Skip NAs when using Reduce() in data.table - r

I'm trying to get the cumulative sum of data.table rows and was able to find this code in another stackoverflow post:
devDF1[,names(devDF1):=Reduce(`+`,devDF1,accumulate=TRUE)]
It does what I need it to do, however when it comes across a row that starts off with an NA, it will just replace every element in that row with NA (instead of the cumsum of the other elements in the row). I don't want to replace the NAs with 0s, because I'll be needing this output for further processes and don't want the same final cumsum duplicated in the rows. Is there any way I can adjust that piece of code to ignore the NAs? Or is there an alternate code that could be used to get the cumulative sum of the rows in a data.table while ignoring NAs?

Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = c(3, NA, 1, 2, 4), c = c(NA, 1, NA, 3, 4))
dt
# a b c
#1: 1 3 NA
#2: 2 NA 1
#3: 3 1 NA
#4: 4 2 3
#5: 5 4 4
If you want to carry previous value to NA values you can use :
dt[, names(dt) := lapply(.SD, function(x) cumsum(replace(x, is.na(x), 0))),
.SDcols = names(dt)]
dt
# a b c
#1: 1 3 0
#2: 3 3 1
#3: 6 4 1
#4: 10 6 4
#5: 15 10 8
If you want to keep NA as NA :
dt[, names(dt) := lapply(.SD, function(x) {
x1 <- cumsum(replace(x, is.na(x), 0))
x1[is.na(x)] <- NA
x1
}), .SDcols = names(dt)]
dt
# a b c
#1: 1 3 NA
#2: 3 NA 1
#3: 6 4 NA
#4: 10 6 4
#5: 15 10 8

Related

replacing NA values with specific averege

i have a data.frame with columns and rows. how could i replace NA values so that it would be the average of the first value before and after that cell in that column?
for example:
1. 1 2 3
2. 4 NA 7
3. 9 NA 8
4. 1 5 6
I need the first NA to be - (5+2)/2=3.5
and the second to be (3.5+5)/2=4.25
Lets create some sample data and transform it to data.table:
require(data.table)
require(zoo)
dat <- data.frame(a = c(1, 2, NA, 4))
setDT(dat)
Now, using the zoo::na.approx function we can impute the missing values.
dat[, newA:= na.approx(a, rule = 2)]
Output:
a newA
1: 1 1
2: 2 2
3: NA 3
4: 4 4

Recode NA with values from similar ID with data.table

I'm in a learning process to use data.table and trying to recode NA to the non-missing values by b.
library(data.table)
dt <- data.table(a = rep(1:3, 2),
b = c(rep(1,3), rep(2, 3)),
c = c(NA, 4, NA, 6, NA, NA))
> dt
a b c
1: 1 1 NA
2: 2 1 4
3: 3 1 NA
4: 1 2 6
5: 2 2 NA
6: 3 2 NA
I would like to get this:
> dt
a b c
1: 1 1 4
2: 2 1 4
3: 3 1 4
4: 1 2 6
5: 2 2 6
6: 3 2 6
I tried these, but none gives the desired result.
dt[, c := ifelse(is.na(c), !is.na(c), c), by = b]
dt[is.na(c), c := dt[!is.na(c), .(c)], by = b]
Appreciate to get some helps and a little bit explanation on how should I consider/think when trying to solve the problem with data.table approach.
Assuming a simple case where there is just one c for each level of b:
dt[, c := c[!is.na(c)][1], by = b]
dt

R previous index per group

I am trying to set the previous observation per group to NA, if a certain condition applies.
Assume I have the following datatable:
DT = data.table(group=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6,6,3,1,1,3,6), a=1:9, b=9:1)
and I am using the simple condition:
DT[y == 6]
How can I set the previous rows of DT[y == 6] within DT to NA, namely the rows with the numbers 2 and 8 of DT? That is, how to set the respectively previous rows per group to NA.
Please note: From DT we can see that there are 3 rows when y is equal to 6, but for group a (row nr 4) I do not want to set the previous row to NA, as the previous row belongs to a different group.
So what I want in different terms is the previous index of certain elements in datatable. Is that possible? Would be also interesting if one can go further back than 1 period. Thanks for any hints.
You can find the row indices where current y is not 6 and next row is 6, then set the whole row to NA:
DT[shift(y, type="lead")==6 & y!=6,
(names(DT)) := lapply(.SD, function(x) NA)]
DT
output:
group v y a b
1: b 1 1 1 9
2: <NA> NA NA NA NA
3: b 1 6 3 7
4: a 2 6 4 6
5: a 2 3 5 5
6: a 1 1 6 4
7: c 1 1 7 3
8: <NA> NA NA NA NA
9: c 2 6 9 1
As usual, Frank commenting with a more succinct version:
DT[shift(y, type="lead")==6 & y!=6, names(DT) := NA]

Reshaping data.table with cumulative sum

I want to reshape a data.table, and include the historic (cumulative summed) information for each variable. The No variable indicates the chronological order of measurements for object ID. At each measurement additional information is found. I want to aggregate the known information at each timestamp No for object ID.
Let me demonstrate with an example:
For the following data.table:
df <- data.table(ID=c(1,1,1,2,2,2,2),
No=c(1,2,3,1,2,3,4),
Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'),
Value=c(2,1,3,3,2,1,5))
df
ID No Variable Value
1: 1 1 a 2
2: 1 2 b 1
3: 1 3 a 3
4: 2 1 c 3
5: 2 2 a 2
6: 2 3 a 1
7: 2 4 b 5
I want to reshape it to this:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
So the summed values of Value, per Variable by (ID, No), cumulative over No.
I can get the result without the cumulative part by doing
dcast(df, ID+No~Variable, value.var="Value")
which results in the non-cumulative variant:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 NA 1 NA
3: 1 3 3 NA NA
4: 2 1 NA NA 3
5: 2 2 2 NA NA
6: 2 3 1 NA NA
7: 2 4 NA 5 NA
Any ideas how to make this cumulative? The original data.table has over 250,000 rows, so efficiency matters.
EDIT: I just used a,b,c as an example, the original file has about 40 different levels. Furthermore, the NAs are important; there are also Value-values of 0, which means something else than NA
POSSIBLE SOLUTION
Okay, so I've found a working solution. It is far from efficient, since it enlarges the original table.
The idea is to duplicate each row TotalNo - No times, where TotalNo is the maximum No per ID. Then the original dcast function can be used to extract the dataframe. So in code:
df[,TotalNo := .N, by=ID]
df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))] #create duplicates
df3 <- df2[order(ID, No)]#, No:= seq_len(.N), by=.(ID, No)]
df3[,No:= seq(from=No[1], to=TotalNo[1], by=1), by=.(ID, No)]
df4<- dcast(df3,
formula = ID + No ~ Variable,
value.var = "Value", fill=NA, fun.aggregate = sum)
It is not really nice, because the creation of duplicates uses more memory. I think it can be further optimized, but so far it works for my purposes. In the sample code it goes from 7 rows to 16 rows, in the original file from 241,670 rows to a whopping 978,331. That's over a factor 4 larger.
SOLUTION
Eddi has improved my solution in computing time in the full dataset (2.08 seconds of Eddi versus 4.36 seconds of mine). Those are numbers I can work with! Thanks everybody!
Your solution is good, but you're adding too many rows, that are unnecessary if you compute the cumsum beforehand:
# add useful columns
df[, TotalNo := .N, by = ID][, CumValue := cumsum(Value), by = .(ID, Variable)]
# do a rolling join to extend the missing values, and then dcast
dcast(df[df[, .(No = seq(No[1], TotalNo[1])), by = .(ID, Variable)],
on = c('ID', 'Variable', 'No'), roll = TRUE],
ID + No ~ Variable, value.var = 'CumValue')
# ID No a b c
#1: 1 1 2 NA NA
#2: 1 2 2 1 NA
#3: 1 3 5 1 NA
#4: 2 1 NA NA 3
#5: 2 2 2 NA 3
#6: 2 3 3 NA 3
#7: 2 4 3 5 3
Here's a standard way:
library(zoo)
df[, cv := cumsum(Value), by = .(ID, Variable)]
DT = dcast(df, ID + No ~ Variable, value.var="cv")
lvls = sort(unique(df$Variable))
DT[, (lvls) := lapply(.SD, na.locf, na.rm = FALSE), by=ID, .SDcols=lvls]
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
One alternative way to do it is using a custom built cumulative sum function. This is exactly the method in #David Arenburg's comment, but substitutes in a custom cumulative summary function.
EDIT: Using #eddi's much more efficient custom cumulative sum function.
cumsum.na <- function(z){
Reduce(function(x, y) if (is.na(x) && is.na(y)) NA else sum(x, y, na.rm = T), z, accumulate = T)
}
cols <- sort(unique(df$Variable))
res <- dcast(df, ID + No ~ Variable, value.var = "Value")[, (cols) := lapply(.SD, cumsum.na), .SDcols = cols, by = ID]
res
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
This definitely isn't the most efficient, but it gets the job done and gives you an admittedly very slow very slow cumulative summary function that handles NAs the way you want to.

Maintain NA's after aggregation R

I have a data frame as follows
test_df<-data.frame(col1=c(1,NA,NA,4,5),col2=c(3,NA,NA,5,6),col3=c("a","b","c","d","c"))
test_df
col1 col2 col3
1 3 a
NA NA b
NA NA c
4 5 d
5 6 c
I am aggregating data based on col3
agg_test<-aggregate(list(test_df$col1,test_df$col2),by=list(test_df$col3),sum,na.rm=T)
agg_test
Col3 col1 col2
a 1 3
b 0 0
c 5 6
d 4 5
From what I know for summation to be correct we need to explicitly define what is to be done with NA's, in this case I have specified that NA's are to be removed from summation, I guess internally R converts all NA's to 0 and sums up according to the by condition. I need to treat the NA's and 0's in my data differently and therefore have to maintain the NA's that are valid (in this case the observations for b are NA's and not 0). How can I achieve this?
Expected o/p
Col3 col1 col2
a 1 3
b NA NA
c 5 6
d 4 5
library(data.table)
unique(setDT(test_df)[, lapply(.SD, function(x)
replace(x, !all(is.na(x)), sum(x, na.rm=TRUE))) , by=col3])
# col3 col1 col2
#1: a 1 3
#2: b NA NA
#3: c 5 6
#4: d 4 5
test_df1 <- test_df
test_df1$col2[2] <- 2
unique(setDT(test_df1)[, lapply(.SD, function(x)
replace(x, !all(is.na(x)), sum(x, na.rm=TRUE))) , by=col3])
# col3 col1 col2
#1: a 1 3
#2: b NA 2
#3: c 5 6
#4: d 4 5
Update
Or using the compact code suggested by #Arun
test_df1$col2[5] <- NA
setDT(test_df1)[, lapply(.SD,
function(x) sum(x,na.rm= !all(is.na(x)))), by=col3]
# col3 col1 col2
#1: a 1 3
#2: b NA 2
#3: c 5 NA
#4: d 4 5
It sounds like (based on your comments to requests for clarification) you want aggregate your groups so you get NA if all the values are missing, and otherwise you want the sum of the non-missing values. You can pass aggregate a user-defined function that has this behavior:
aggregate(list(test_df$col1,test_df$col2), by=list(test_df$col3),
function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm=T)))
# Group.1 c.1..NA..NA..4..5. c.3..NA..NA..5..6.
# 1 a 1 3
# 2 b NA NA
# 3 c 5 6
# 4 d 4 5

Resources