Refer to previous row in data.table in R, with a condition - r

i have a new problem with this data. Because my full data has the form like this
a=data.table(A=c(1:10),B=c(1,2,0,2,0,0,3,4,0,2),C=c(2,3,1,4,5,3,6,7,2,2),D=c(1,1,1,1,1,2,2,2,2,2))
# A B C D
# 1: 1 1 2 1
# 2: 2 2 3 1
# 3: 3 0 1 1
# 4: 4 2 4 1
# 5: 5 0 5 1
# 6: 6 0 3 2
# 7: 7 3 6 2
# 8: 8 4 7 2
# 9: 9 0 2 2
#10: 10 2 2 2
Now, I want to create a new column, which calculates the number of values of A multiple with B/C of the closet previous row, as long as B is not 0. For example, in line 2, I can calculate D=2*(1/2). However, in line 4, it has to be 4*(2/3), it can not be 4*(0/1).
I use
a[, D:= {i1 <- (NA^!B)
list( A*shift(na.locf(i1*B))/shift(na.locf(i1*C)))},by=d]
as Akrun recommended yesterday. It does not work when i calculate it by group.the result is like this
A B C d D
# 1: 1 1 2 1 NA
# 2: 2 2 3 1 1.000000
# 3: 3 0 1 1 2.000000
# 4: 4 2 4 1 2.666667
# 5: 5 0 5 1 2.500000
# 6: 6 0 3 2 NA
# 7: 7 3 6 2 3.500000
# 8: 8 4 7 2 4.571429
# 9: 9 0 2 2 5.142857
# 10: 10 2 2 2 NA
Anyone knows what is the problem here? The error is longer object length is not a multiple of shorter object length.

We can replace the elements in 'B', 'C' that corresponds to '0' value in 'B' as NA. Use na.locf from zoo to replace those NA values with the previous non-NA elements, shift the elements (by default, it gives a lag of 1), divide the modified columns 'B' with 'C' and then multiply by 'A'. Assign (:=) the output to a new column 'D'.
library(zoo)
a[B==0, c('B', 'C'):=list(NA, NA)]
a[, c('B', 'C'):= na.locf(.SD), .SDcols=B:C]
a[, D:= {tmp <- shift(.SD[, 2:3, with=FALSE])
A*(tmp[[1]]/tmp[[2]])}]
Or we can make it compact. We get a logical vector (!B) that checks for '0' elements in 'B', convert that to a vector of 1s and NA (NA^), multiply with columns 'B' and 'C' so that the 1s are replaced by the corresponding elements in those columns whereas NA remains as such. Do the na.locf (as before), shift and then do the multiplication/division.
a[, D:= {i1 <- (NA^!B)
list( A*shift(na.locf(i1*B))/shift(na.locf(i1*C)))}]
Or instead of calling shift/na.locf two times
a[, D:= {i1 <- (NA^!B)
tmp <- shift(na.locf(i1*.SD))
a[['A']]*(tmp[[1]]/tmp[[2]])}, .SDcols=B:C]

This can be done with a rolling join:
a[, row := .I]
a[, B/C, by=row][V1 != 0][a, A*shift(V1), on="row", roll=TRUE]
# [1] NA 1.000000 2.000000 2.666667 2.500000 3.000000 3.500000 4.000000
# [9] 5.142857 5.714286

Related

replace row values based on another row value in a data.table

I have a trivial question, though I am struggling to find a simple answer. I have a data table that looks something like this:
dt <- data.table(id= c(A,A,A,A,B,B,B,C,C,C), time=c(1,2,3,1,2,3,1,2,3), score = c(10,15,13,25,NA,NA,18,29,19))
dt
# id time score
# 1: A 1 NA
# 2: A 2 10
# 3: A 3 15
# 4: A 4 13
# 5: B 1 NA
# 6: B 2 25
# 7: B 3 NA
# 8: B 4 NA
# 9: C 1 18
# 10: C 2 29
# 11: C 3 NA
# 12: C 4 19
I would like to replace the missing values of my group "B" with the values of "A".
The final dataset should look something like this
final
# id time score
# 1: A 1 NA
# 2: A 2 10
# 3: A 3 15
# 4: A 4 13
# 5: B 1 NA
# 6: B 2 25
# 7: B 3 15
# 8: B 4 13
# 9: C 1 18
# 10: C 2 29
# 11: C 3 NA
# 12: C 4 19
In other words, conditional on the fact that B is NA, I would like to replace the score of "A". Do note that "C" remains NA.
I am struggling to find a clean way to do this using data.table. However, if it is simpler with other methods it would still be ok.
Thanks a lot for your help
Here is one option where we get the index of the rows which are NA for 'score' and the 'id' is "B", use that to replace the NA with the corresponding 'score' value from 'A'
library(data.table)
i1 <- setDT(dt)[id == 'B', which(is.na(score))]
dt[, score:= replace(score, id == 'B' & is.na(score), score[which(id == 'A')[i1]])]
Or a similar option in dplyr
library(dplyr)
dt %>%
mutate(score = replace(score, id == "B" & is.na(score),
score[which(id == "A")[i1]))

How to replace a certain value in one data.table with values of another data.table of same dimension

Given two data.table:
dt1 <- data.table(id = c(1,-99,2,2,-99), a = c(2,1,-99,-99,3), b = c(5,3,3,2,5), c = c(-99,-99,-99,2,5))
dt2 <- data.table(id = c(2,3,1,4,3),a = c(6,4,3,2,6), b = c(3,7,8,8,3), c = c(2,2,4,3,2))
> dt1
id a b c
1: 1 2 5 -99
2: -99 1 3 -99
3: 2 -99 3 -99
4: 2 -99 2 2
5: -99 3 5 5
> dt2
id a b c
1: 2 6 3 2
2: 3 4 7 2
3: 1 3 8 4
4: 4 2 8 3
5: 3 6 3 2
How can one replace the -99 of dt1 with the values of dt2?
Wanted results should be dt3:
> dt3
id a b c
1: 1 2 5 2
2: 3 1 3 2
3: 2 3 3 4
4: 2 2 2 2
5: 3 3 5 5
You can do the following:
dt3 <- as.data.frame(dt1)
dt2 <- as.data.frame(dt2)
dt3[dt3 == -99] <- dt2[dt3 == -99]
dt3
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5
If your data is all of the same type (as in your example) then transforming them to matrix is a lot faster and transparent:
dt1a <- as.matrix(dt1) ## convert to matrix
dt2a <- as.matrix(dt2)
# make a matrix of the same shape to access the right entries
missing_idx <- dt1a == -99
dt1a[missing_idx] <- dt2a[missing_idx] ## replace by reference
This is a vectorized operation, so it should be fast.
Note: If you do this make sure the two data sources match exactly in shape and order of rows/columns. If they don't then you need to join by the relevant keys and pick the correct columns.
EDIT: The conversion to matrix may be unnecessary. See kath's answer for a more terse solution.
Simple way could be to use setDF function to convert to data.frame and use data frame sub-setting methods. Restore to data.table at the end.
#Change to data.frmae
setDF(dt1)
setDF(dt2)
# Perform assignment
dt1[dt1==-99] = dt2[dt1==-99]
# Restore back to data.table
setDT(dt1)
setDT(dt2)
dt1
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5
This simple trick would work efficiently.
dt1<-as.matrix(dt1)
dt2<-as.matrix(dt2)
index.replace = dt1==-99
dt1[index.replace] = dt2[index.replace]
as.data.table(dt1)
as.data.table(dt2)
This should work, a simple approach:
for (i in 1:nrow(dt1)){
for (j in 1:ncol(dt1)){
if (dt1[i,j] == -99) dt1[i,j] = dt2[i,j]
}
}

Reshaping data.table with cumulative sum

I want to reshape a data.table, and include the historic (cumulative summed) information for each variable. The No variable indicates the chronological order of measurements for object ID. At each measurement additional information is found. I want to aggregate the known information at each timestamp No for object ID.
Let me demonstrate with an example:
For the following data.table:
df <- data.table(ID=c(1,1,1,2,2,2,2),
No=c(1,2,3,1,2,3,4),
Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'),
Value=c(2,1,3,3,2,1,5))
df
ID No Variable Value
1: 1 1 a 2
2: 1 2 b 1
3: 1 3 a 3
4: 2 1 c 3
5: 2 2 a 2
6: 2 3 a 1
7: 2 4 b 5
I want to reshape it to this:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
So the summed values of Value, per Variable by (ID, No), cumulative over No.
I can get the result without the cumulative part by doing
dcast(df, ID+No~Variable, value.var="Value")
which results in the non-cumulative variant:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 NA 1 NA
3: 1 3 3 NA NA
4: 2 1 NA NA 3
5: 2 2 2 NA NA
6: 2 3 1 NA NA
7: 2 4 NA 5 NA
Any ideas how to make this cumulative? The original data.table has over 250,000 rows, so efficiency matters.
EDIT: I just used a,b,c as an example, the original file has about 40 different levels. Furthermore, the NAs are important; there are also Value-values of 0, which means something else than NA
POSSIBLE SOLUTION
Okay, so I've found a working solution. It is far from efficient, since it enlarges the original table.
The idea is to duplicate each row TotalNo - No times, where TotalNo is the maximum No per ID. Then the original dcast function can be used to extract the dataframe. So in code:
df[,TotalNo := .N, by=ID]
df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))] #create duplicates
df3 <- df2[order(ID, No)]#, No:= seq_len(.N), by=.(ID, No)]
df3[,No:= seq(from=No[1], to=TotalNo[1], by=1), by=.(ID, No)]
df4<- dcast(df3,
formula = ID + No ~ Variable,
value.var = "Value", fill=NA, fun.aggregate = sum)
It is not really nice, because the creation of duplicates uses more memory. I think it can be further optimized, but so far it works for my purposes. In the sample code it goes from 7 rows to 16 rows, in the original file from 241,670 rows to a whopping 978,331. That's over a factor 4 larger.
SOLUTION
Eddi has improved my solution in computing time in the full dataset (2.08 seconds of Eddi versus 4.36 seconds of mine). Those are numbers I can work with! Thanks everybody!
Your solution is good, but you're adding too many rows, that are unnecessary if you compute the cumsum beforehand:
# add useful columns
df[, TotalNo := .N, by = ID][, CumValue := cumsum(Value), by = .(ID, Variable)]
# do a rolling join to extend the missing values, and then dcast
dcast(df[df[, .(No = seq(No[1], TotalNo[1])), by = .(ID, Variable)],
on = c('ID', 'Variable', 'No'), roll = TRUE],
ID + No ~ Variable, value.var = 'CumValue')
# ID No a b c
#1: 1 1 2 NA NA
#2: 1 2 2 1 NA
#3: 1 3 5 1 NA
#4: 2 1 NA NA 3
#5: 2 2 2 NA 3
#6: 2 3 3 NA 3
#7: 2 4 3 5 3
Here's a standard way:
library(zoo)
df[, cv := cumsum(Value), by = .(ID, Variable)]
DT = dcast(df, ID + No ~ Variable, value.var="cv")
lvls = sort(unique(df$Variable))
DT[, (lvls) := lapply(.SD, na.locf, na.rm = FALSE), by=ID, .SDcols=lvls]
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
One alternative way to do it is using a custom built cumulative sum function. This is exactly the method in #David Arenburg's comment, but substitutes in a custom cumulative summary function.
EDIT: Using #eddi's much more efficient custom cumulative sum function.
cumsum.na <- function(z){
Reduce(function(x, y) if (is.na(x) && is.na(y)) NA else sum(x, y, na.rm = T), z, accumulate = T)
}
cols <- sort(unique(df$Variable))
res <- dcast(df, ID + No ~ Variable, value.var = "Value")[, (cols) := lapply(.SD, cumsum.na), .SDcols = cols, by = ID]
res
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
This definitely isn't the most efficient, but it gets the job done and gives you an admittedly very slow very slow cumulative summary function that handles NAs the way you want to.

Keep only 'by' variables when collapsing data.table

I have a very large data.table:
DT <- data.table(a=c(1,1,1,1,2,2,2,2,3,3,3,3),b=c(1,1,2,2),c=1:12)
And I need to collapse it by several variables, e.g. list(a,b). Easy:
DT[,sum(c),by=list(a,b)]
a b V1
1: 1 1 3
2: 1 2 7
3: 2 1 11
4: 2 2 15
5: 3 1 19
6: 3 2 23
However, I don't want to take any operation on c, I just want to drop it:
DT[,,by=list(a,b)] # includes a,b,c, thus does not collapse
DT[,list(),by=list(a,b)] # zero rows
DT[,a,by=list(a,b)] # what I want but adds extraneous column a after 'by' columns
How can I specify X below to get the indicated result?
DT[,X,by=list(a,b)]
a b
1: 1 1
2: 1 2
3: 2 1
4: 2 2
5: 3 1
6: 3 2
unique.data.table has a by argument, you could then subset result to get the columns you want.
eg
unique(DT, by = c('a', 'b'))[, c('a','b')]

data.table assigning with `sapply` in a merge

I have some data.tables like so:
x <- data.table(id=rep(1:3, 2), a=1:6)
y <- data.table(id=1:3, b=2:4)
I can merge them like this:
setkey(x, id)
setkey(y, id)
x[y]
id a b
1: 1 1 2
2: 1 4 2
3: 2 2 3
4: 2 5 3
5: 3 3 4
6: 3 6 4
Now, I want to create a new column in x based off a and b which is the sum of a and b.
I can do this with:
x[y, val:=a + b]
However, now suppose for some reason that the '+' operator is not vectorised. How can I store a row-wise calculation into x where x[y] is needed for the calculation? Also, assume I cannot use mapply (because for my actual problem, mapply is not suited to the function).
I'm trying to use sapply like so to add in a row-wise manner:
x[y, sapply(1:nrow(x), function (i) a[i] + b[i])]
However this returns the incorrect result:
id V1
1: 1 3
2: 1 NA
3: 1 NA
4: 1 NA
5: 1 NA
6: 1 NA
7: 2 5
8: 2 NA
9: 2 NA
10: 2 NA
11: 2 NA
12: 2 NA
13: 3 7
14: 3 NA
15: 3 NA
16: 3 NA
17: 3 NA
18: 3 NA
If I do this it works:
x[y][, sapply(1:nrow(x), function (i) a[i] + b[i])]
# [1] 3 6 5 8 7 10
BUT when I try and assign this to a column in x, it is not stored (makes sense because it looks like I'm trying to save the new column into x[y]).
x[y][, val:=sapply(1:nrow(x), function (i) a[i] + b[i])]
Is there any way to do the above but save the output into x[, val]?
Is this how I am supposed to do it, or is there a more data.table-y way?
x[, val:=x[y][, sapply(1:nrow(x), function (i) a[i] + b[i])]]
You are doing by-without-by without knowing it, (see below for the description from the help)
Advanced: Aggregation for a subset of known groups is particularly
efficient when passing those groups in i. When i is a data.table,
DT[i,j] evaluates j for each row of i. We call this by without by or
grouping by i. Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].
This means that j is evaluated for each row of i (cylcing through y one row at a time -- so that if you run sapply(1:nrow(x),...) in j it will create a vector of length nrow(x) each time, when this is not what you want.
So your second option is definitely a valid approach (as it is one of the recommended approaches for doing this)
Otherwise you could use .N (When grouping by i, .N is the number of rows in x matched to, for each row of i) not nrow(x), but you will have to think about the length of your objects and how your function is to be vectorized.
Take this as an example
x[y, {browser(); a+b}]
Called from: `[.data.table`(x, y, {
browser()
a + b
})
Browse[1]> a
[1] 1 4
Browse[1]> b
[1] 2
Browse[1]> .N
[1] 2
a has length two, because value of the key matches with 2 rows from x. b only has length 1 because it only has length 1 in y.
I think the best approach is to correctly Vectorize your function (which is hard to give advice upon without more of an example)
another approach would be to replicate b to the length of a eg
x[y, val := {
bl <- rep_len(b, .N)
sapply(seq_len(.N), function(i) a[i] + bl[i])}]
x
id a val
1: 1 1 3
2: 1 4 6
3: 2 2 5
4: 2 5 8
5: 3 3 7
6: 3 6 10
or if you know that y has unique rows for each value of id, then you don't need to try and index any columns from it.
x[y, val2 := sapply(seq_len(.N), function(i) a[i] + b)]
# an alternative would be to use sapply on a (avoid creating another vector)
x[y, val3 := sapply(a, function(ai) ai + b)]
x
# id a val val2 val3
# 1: 1 1 3 3 3
# 2: 1 4 6 6 6
# 3: 2 2 5 5 5
# 4: 2 5 8 8 8
# 5: 3 3 7 7 7
# 6: 3 6 10 10 10

Resources