Recode NA with values from similar ID with data.table - r

I'm in a learning process to use data.table and trying to recode NA to the non-missing values by b.
library(data.table)
dt <- data.table(a = rep(1:3, 2),
b = c(rep(1,3), rep(2, 3)),
c = c(NA, 4, NA, 6, NA, NA))
> dt
a b c
1: 1 1 NA
2: 2 1 4
3: 3 1 NA
4: 1 2 6
5: 2 2 NA
6: 3 2 NA
I would like to get this:
> dt
a b c
1: 1 1 4
2: 2 1 4
3: 3 1 4
4: 1 2 6
5: 2 2 6
6: 3 2 6
I tried these, but none gives the desired result.
dt[, c := ifelse(is.na(c), !is.na(c), c), by = b]
dt[is.na(c), c := dt[!is.na(c), .(c)], by = b]
Appreciate to get some helps and a little bit explanation on how should I consider/think when trying to solve the problem with data.table approach.

Assuming a simple case where there is just one c for each level of b:
dt[, c := c[!is.na(c)][1], by = b]
dt

Related

Skip NAs when using Reduce() in data.table

I'm trying to get the cumulative sum of data.table rows and was able to find this code in another stackoverflow post:
devDF1[,names(devDF1):=Reduce(`+`,devDF1,accumulate=TRUE)]
It does what I need it to do, however when it comes across a row that starts off with an NA, it will just replace every element in that row with NA (instead of the cumsum of the other elements in the row). I don't want to replace the NAs with 0s, because I'll be needing this output for further processes and don't want the same final cumsum duplicated in the rows. Is there any way I can adjust that piece of code to ignore the NAs? Or is there an alternate code that could be used to get the cumulative sum of the rows in a data.table while ignoring NAs?
Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = c(3, NA, 1, 2, 4), c = c(NA, 1, NA, 3, 4))
dt
# a b c
#1: 1 3 NA
#2: 2 NA 1
#3: 3 1 NA
#4: 4 2 3
#5: 5 4 4
If you want to carry previous value to NA values you can use :
dt[, names(dt) := lapply(.SD, function(x) cumsum(replace(x, is.na(x), 0))),
.SDcols = names(dt)]
dt
# a b c
#1: 1 3 0
#2: 3 3 1
#3: 6 4 1
#4: 10 6 4
#5: 15 10 8
If you want to keep NA as NA :
dt[, names(dt) := lapply(.SD, function(x) {
x1 <- cumsum(replace(x, is.na(x), 0))
x1[is.na(x)] <- NA
x1
}), .SDcols = names(dt)]
dt
# a b c
#1: 1 3 NA
#2: 3 NA 1
#3: 6 4 NA
#4: 10 6 4
#5: 15 10 8

Filter duplicate sequence of rows

(Note i was surprised not finding a similar question but i am happy to remove this one if i am mistaken).
I have the following sample dataset.
library(data.table)
dt <- data.table(val = c(1, 2, 3, 0, 2, 4, 1, 2, 3), id = c(1, 1, 1, 2, 2, 2, 3, 3, 3))
Group with id=1 has the same values for val (1,2,3) as group with id=3. I would like to filter out this "duplicate" values in group id=3.
My desired output is:
> dt
val id
1: 1 1
2: 2 1
3: 3 1
4: 0 2
5: 2 2
6: 4 2
I only came up with dirty workarounds like taking the sum: dt[, filter:= sum(val) , by = id] and remove duplicates, but then the values for id = 2 would also disappear.
Note: If values for id=3 would be 1,3,2 (so same values but different order, the rows should not be removed),..so order matters.
This is not a data.table specific approach, but it would work:
x = split(dt$val, dt$id)
dt[!id %in% names(x[duplicated(x)])]
# val id
#1: 1 1
#2: 2 1
#3: 3 1
#4: 0 2
#5: 2 2
#6: 4 2
It might be not optimal in terms of efficiency.
You can convert to string, remove duplicates and merge, i.e.
merge(dt, unique(dt[, .(new = toString(val)), id], by = 'new'))[,new := NULL][]
# id val
#1: 1 1
#2: 1 2
#3: 1 3
#4: 2 0
#5: 2 2
#6: 2 4
We can avoid merge by pulling the ids and using %in%, i.e.
i1 <- unique(dt[, .(new = toString(val)), id], by = 'new')[, id]
dt[id %in% i1,]
# val id
#1: 1 1
#2: 2 1
#3: 3 1
#4: 0 2
#5: 2 2
#6: 4 2
Another option with data.table:
dt <- dt[, pat := paste(val, collapse = "/"), by = id][
, .SD[which.min(rleid(pat))], by = .(pat, val)][, pat := NULL]
Output:
val id
1: 1 1
2: 2 1
3: 3 1
4: 0 2
5: 2 2
6: 4 2

Add max column using variable

I am trying to do the same thing as this question: Add max value to a new column in R, however, I want to pass in a variable instead of the column name directly so I don't hard code the columns name into the formula.
Sample code:
a <- c(1,1,2,2,3,3)
b <- c(1,3,5,9,4,NA)
d <- data.table(a, b)
d
a b
1 1
1 3
2 5
2 9
3 4
3 NA
I can get this:
a b max_b
1 1 3
1 3 3
2 5 9
2 9 9
3 4 4
3 NA 4
By hard coding it: setDT(d)[, max_b:= max(b, na.rm = T), a] but I would like to do something like this instead:
cn <- "b"
setDT(d)[, paste0("max_", cn):= max(cn, na.rm = T), a]
However, this is not working because inside of max() it evaluates to max of the character instead of the column. And it evaluates to a column named max_b that contains the value b because max("b") = "b". I get why this is happening, I just do not know a workaround.
What is a solution to this?
Note: the above stack question I tagged was marked as a duplicate and closed, but I chose that question because I am using the accepted answer from it in my code. I also do not 100% agree that it is a duplicate question anyways.
Try setDT(d)[, paste0("max_", cn) := eval(parse(text = max(eval(parse(text = cn))))), a]
# output
a b max_b
1: 1 1 3
2: 1 3 3
3: 2 5 9
4: 2 9 9
5: 3 4 4
# example with missing values
a <- c(1,1,2,2,3,3)
b <- c(1,3,5,9,4,NA)
d <- data.table(a, b)
cn <- "b"
setDT(d)[, paste0("max_", cn) := eval(parse(text = max(eval(parse(text = cn)),
na.rm = TRUE))), a]
#output
a b max_b
1: 1 1 3
2: 1 3 3
3: 2 5 9
4: 2 9 9
5: 3 4 4
6: 3 NA 4
One option is to specify the variable in .SDcols and then apply the function on .SD (Subset of Data.table).
d[, paste0("max_", cn) := lapply(.SD, max, na.rm = TRUE), by = a, .SDcols = cn]
d
# a b max_b
#1: 1 1 3
#2: 1 3 3
#3: 2 5 9
#4: 2 9 9
#5: 3 4 4
#6: 3 NA 4
Another option is converting to symbol and then do the evaluation
d[, paste0("max_", cn) := max(eval(as.symbol(cn)), na.rm = TRUE), by = a]

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

Reshaping data.table with cumulative sum

I want to reshape a data.table, and include the historic (cumulative summed) information for each variable. The No variable indicates the chronological order of measurements for object ID. At each measurement additional information is found. I want to aggregate the known information at each timestamp No for object ID.
Let me demonstrate with an example:
For the following data.table:
df <- data.table(ID=c(1,1,1,2,2,2,2),
No=c(1,2,3,1,2,3,4),
Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'),
Value=c(2,1,3,3,2,1,5))
df
ID No Variable Value
1: 1 1 a 2
2: 1 2 b 1
3: 1 3 a 3
4: 2 1 c 3
5: 2 2 a 2
6: 2 3 a 1
7: 2 4 b 5
I want to reshape it to this:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
So the summed values of Value, per Variable by (ID, No), cumulative over No.
I can get the result without the cumulative part by doing
dcast(df, ID+No~Variable, value.var="Value")
which results in the non-cumulative variant:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 NA 1 NA
3: 1 3 3 NA NA
4: 2 1 NA NA 3
5: 2 2 2 NA NA
6: 2 3 1 NA NA
7: 2 4 NA 5 NA
Any ideas how to make this cumulative? The original data.table has over 250,000 rows, so efficiency matters.
EDIT: I just used a,b,c as an example, the original file has about 40 different levels. Furthermore, the NAs are important; there are also Value-values of 0, which means something else than NA
POSSIBLE SOLUTION
Okay, so I've found a working solution. It is far from efficient, since it enlarges the original table.
The idea is to duplicate each row TotalNo - No times, where TotalNo is the maximum No per ID. Then the original dcast function can be used to extract the dataframe. So in code:
df[,TotalNo := .N, by=ID]
df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))] #create duplicates
df3 <- df2[order(ID, No)]#, No:= seq_len(.N), by=.(ID, No)]
df3[,No:= seq(from=No[1], to=TotalNo[1], by=1), by=.(ID, No)]
df4<- dcast(df3,
formula = ID + No ~ Variable,
value.var = "Value", fill=NA, fun.aggregate = sum)
It is not really nice, because the creation of duplicates uses more memory. I think it can be further optimized, but so far it works for my purposes. In the sample code it goes from 7 rows to 16 rows, in the original file from 241,670 rows to a whopping 978,331. That's over a factor 4 larger.
SOLUTION
Eddi has improved my solution in computing time in the full dataset (2.08 seconds of Eddi versus 4.36 seconds of mine). Those are numbers I can work with! Thanks everybody!
Your solution is good, but you're adding too many rows, that are unnecessary if you compute the cumsum beforehand:
# add useful columns
df[, TotalNo := .N, by = ID][, CumValue := cumsum(Value), by = .(ID, Variable)]
# do a rolling join to extend the missing values, and then dcast
dcast(df[df[, .(No = seq(No[1], TotalNo[1])), by = .(ID, Variable)],
on = c('ID', 'Variable', 'No'), roll = TRUE],
ID + No ~ Variable, value.var = 'CumValue')
# ID No a b c
#1: 1 1 2 NA NA
#2: 1 2 2 1 NA
#3: 1 3 5 1 NA
#4: 2 1 NA NA 3
#5: 2 2 2 NA 3
#6: 2 3 3 NA 3
#7: 2 4 3 5 3
Here's a standard way:
library(zoo)
df[, cv := cumsum(Value), by = .(ID, Variable)]
DT = dcast(df, ID + No ~ Variable, value.var="cv")
lvls = sort(unique(df$Variable))
DT[, (lvls) := lapply(.SD, na.locf, na.rm = FALSE), by=ID, .SDcols=lvls]
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
One alternative way to do it is using a custom built cumulative sum function. This is exactly the method in #David Arenburg's comment, but substitutes in a custom cumulative summary function.
EDIT: Using #eddi's much more efficient custom cumulative sum function.
cumsum.na <- function(z){
Reduce(function(x, y) if (is.na(x) && is.na(y)) NA else sum(x, y, na.rm = T), z, accumulate = T)
}
cols <- sort(unique(df$Variable))
res <- dcast(df, ID + No ~ Variable, value.var = "Value")[, (cols) := lapply(.SD, cumsum.na), .SDcols = cols, by = ID]
res
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
This definitely isn't the most efficient, but it gets the job done and gives you an admittedly very slow very slow cumulative summary function that handles NAs the way you want to.

Resources