Maximum value of one data.table column based on other columns - r

I have a R data.table
DT = data.table(x=rep(c("b","a",NA_character_),each=3), y=rep(c('A', NA_character_, 'C'), each=3), z=c(NA_character_), v=1:9)
DT
# x y z v
#1: b A NA 1
#2: b A NA 2
#3: b A NA 3
#4: a NA NA 4
#5: a NA NA 5
#6: a NA NA 6
#7: NA C NA 7
#8: NA C NA 8
#9: NA C NA 9
For each column if the value is not NA, I want to extract the max value from column v. I am using
sapply(DT, function(x) { ifelse(all(is.na(x)), NA_integer_, max(DT[['v']][!is.na(x)])) })
#x y z v
#6 9 NA 9
Is there a simpler way to achive this?

here is a way, giving you -Inf (and a warning) if all values of the column are NA (you can later replace that by NA if you prefer):
DT[, lapply(.SD, function(x) max(v[!is.na(x)]))]
# x y z v
# 1: 6 9 -Inf 9
As suggested by #DavidArenburg, to ensure that everything goes well even when all values are NA (no warning and directly NA as result), you can do:
DT[, lapply(.SD, function(x) {
temp <- v[!is.na(x)]
if(!length(temp)) NA else max(temp)
})]
# x y z v
#1: 6 9 NA 9

We can use summarise_each from dplyr
library(dplyr)
DT %>%
summarise_each(funs(max(v[!is.na(.)])))
# x y z v
#1: 6 9 -Inf 9

Related

rollmean with grouped data.table returns a logical

I am trying to use rollmean from the package zoo in a data.table while grouping data.
It works fine when all groups have enough data:
library(data.table)
dt = data.table(x=rep(c("a","b"),10),y=rnorm(20))
dt[,.(ma=rollmean(y, k = 7, fill=NA,align="right")), by = .(x)]
But when one of the groups has too little data, it returns an error
dt2 = data.table(x=rep(c("c"),1),y=rnorm(1))
dt3=rbind(dt,dt2)
dt3[,.(ma=rollmean(y, k = 7, fill=NA,align="right")), by = .(x)]
Here's the error message:
Column 1 of result for group 3 is type 'logical' but expecting type 'double'. Column types must be consistent for each group.
It seems to happen because rollmean returns a logical (a mix of TRUE and NA) when it doesn't have enough data
Given that my data is always positive I use the following trick to make my code run anyway
dt4=dt3[,.(ma=rollmean(y, k = 7, fill=-1,align="right")), by = .(x)]
dt4[ma==-1,ma:=NA]
dt4
Is there a proper/better way to do it?
We can use the NA_real_ instead of NA as by default it would be NA_logical_
dt3[x == 'c', class(rollmean(y, k = 7, fill = NA, align = 'right'))]
#[1] "logical"
With NA_real_ in fill, it would work fine
dt3[,.(ma=rollmean(y, k = 7, fill=NA_real_,align="right")), by = .(x)]
# x ma
# 1: a NA
# 2: a NA
# 3: a NA
# 4: a NA
# 5: a NA
# 6: a NA
# 7: a 0.19653855
# 8: a -0.05506344
# 9: a -0.17022022
#10: a -0.28731762
#11: b NA
#12: b NA
#13: b NA
#14: b NA
#15: b NA
#16: b NA
#17: b 0.02117906
#18: b -0.07079598
#19: b -0.05393943
#20: b 0.04511924
#21: c NA
x ma
In other groups, it is also creating NA, but the difference is that it gets coerced to numeric NA when there are non-NA elements

Calculate cummean() and cumsd() while ignoring NA values and filling NAs

My goal is to obtain the cum mean (and cumsd) of a dataframe while ignoring NA and filling those with the previous cum means:
df:
var1 var2 var3
x1 y1 z1
x2 y2 z2
NA NA NA
x3 y3 z3
cummean:
var1 var2 var3
x1/1 y1/1 z1/1
(x1+x2)/2 (y1+y2)/2 (z1+z2)/2
(x1+x2)/2 (y1+y2)/2 (z1+z2)/2
(x1+x2+x3)/3 (y1+y2+y3)/3 (z1+z2+z3)/3
So for row 3 where df has NA, I want the new matrix to contain the cum mean from the line above (numerator should not increase).
So far, I am using this to compute the cum mean (I am aware that somewhere a baby seal gets killed because I used a for loop and not something from the apply family)
for(i in names(df){
df[i][!is.na(df[i])] <- GMCM:::cummean(df[i][!is.na(df[i])])
}
I have also tried this:
setDT(posRegimeReturns)
cols<-colnames((posRegimeReturns))
posRegimeReturns[, (cols) := lapply(.SD, cummean) , .SD = cols]
But both of those leave the NAs empty.
Note: this question is similar to this post Calculate cumsum() while ignoring NA values
but unlike the solution there, I don't want to leave the NAs but rather fill those with the same values as the last row above that was not NA.
You might want to use the definition of variance to calculate this
library(data.table)
dt <- data.table(V1=c(1,2,NA,3), V2=c(1,2,NA,3), V3=c(1,2,NA,3))
cols <- copy(names(dt))
#means
dt[ , paste0("mean_",cols) := lapply(.SD, function(x) {
#get the num of non-NA observations
lens <- cumsum(!is.na(x))
#set NA to 0 before doing cumulative sum
x[is.na(x)] <- 0
cumsum(x) / lens
}), .SDcols=cols]
#sd
dt[ , paste0("sd_",cols) := lapply(.SD, function(x) {
lens <- cumsum(!is.na(x))
x[is.na(x)] <- 0
#use defn of variance mean of sum of squares minus square of means and also n-1 in denominator
sqrt(lens/(lens-1) * (cumsum(x^2)/lens - (cumsum(x) / lens)^2))
}), .SDcols=cols]
Using data table. In particular:
library(data.table)
DT <- data.table(z = sample(N),idx=1:N,key="idx")
z idx
1: 4 1
2: 10 2
3: 9 3
4: 6 4
5: 1 5
6: 8 6
7: 3 7
8: 7 8
9: 5 9
10: 2 10
We now make use of the use of -apply function and data.table.
DT[,cummean:=sapply(seq(from=1,to=nrow(DT)) ,function(iii) mean(DT$z[1:iii],na.rm = TRUE))]
DT[,cumsd:=sapply(seq(from=1,to=nrow(DT)) ,function(iii) sd(DT$z[1:iii],na.rm = TRUE))]
resulting in:
z idx cummean cumsd
1: 4 1 4.000000 NA
2: 10 2 7.000000 4.242641
3: 9 3 7.666667 3.214550
4: 6 4 7.250000 2.753785
5: 1 5 6.000000 3.674235
6: 8 6 6.333333 3.386247
7: 3 7 5.857143 3.338092
8: 7 8 6.000000 3.116775
9: 5 9 5.888889 2.934469
10: 2 10 5.500000 3.027650

Merge 2 columns in R

I have a data set with columns I'd like to merge similar to this:
library(data.table)
DF <- as.data.table(list(ID = c(1,2,3,4,5), Product = c('Y', NA, NA, 'Z', NA), Type = c(NA, 'D', 'G', NA, NA)))
DF
ID Product Type
1 Y NA
2 NA D
3 NA G
4 Z NA
5 NA NA
which I would like to look like this:
DF
ID Product Type Category
1 Y NA Y
2 NA D D
3 NA G G
4 Z NA Z
5 NA NA NA
My Code is:
DF[,Category := na.omit(c(Product,Type)), by = ID][,c("Product","Type"):=NULL]
The problem that I have is that I would like to have for the Category to be NA when both Product and Type are NAs. Also, I don't know if my code works because my data set has over 200,000 rows.
DF[ , Category := ifelse(is.na(Product), Type, Product)]
# ID Product Type Category
#1: 1 Y NA Y
#2: 2 NA D D
#3: 3 NA G G
#4: 4 Z NA Z
#5: 5 NA NA NA
This is assuming if there are values for both Product and Type, you want Product in Category
We can do this in two assignments and avoid ifelse as assignment in place (:=) is faster and efficient.
DF[, Category := Product][is.na(Product), Category := Type][]
# ID Product Type Category
#1: 1 Y NA Y
#2: 2 NA D D
#3: 3 NA G G
#4: 4 Z NA Z
#5: 5 NA NA NA
Or if we assume that there will be only a maximum 1 non-NA value per row for Product/Type, then pmax can be used.
DF[, Category := pmax(Product, Type, na.rm = TRUE)][]
# ID Product Type Category
#1: 1 Y NA Y
#2: 2 NA D D
#3: 3 NA G G
#4: 4 Z NA Z
#5: 5 NA NA NA
Benchmarks
DF1 <- DF[rep(1:nrow(DF), 1e6)]
DF2 <- copy(DF1)
DF3 <- copy(DF1)
system.time(DF1[, Category := Product][is.na(Product), Category := Type])
# user system elapsed
# 0.16 0.06 0.17
system.time(DF2[ , Category := ifelse(is.na(Product), Type, Product)])
# user system elapsed
# 1.35 0.19 1.53
system.time(DF3[ ,Category := pmax(Product, Type, na.rm = TRUE)])
# user system elapsed
# 0.04 0.02 0.06
EDIT: Updated with the benchmarks and it clearly shows both the methods mentioned in my post are efficient.

Remove leading NAs to align data

I have a large data.frame with 'staggered' data and would like to align it. What I mean is I would like to take something like
and remove the leading (top) NAs from all columns to get
I know about the na.trim function from the zoo package, but this didn't work on either the initial data.frame presented above or its transpose. For this I used, with transposed dataframe t.df,
t.df <- na.trim(t.df, sides = 'left')
This only returned an empty data.frame, and wouldn't work the way I wanted anyway since it would create vectors of different lengths. Can anyone point me to a package or function that might be more helpful?
Here is the code for my example used above:
# example of what I have
var1 <- c(1,2,3,4,5,6,7,8,9,10)
var2 <- c(6,2,4,7,3,NA,NA,NA,NA,NA)
var3 <- c(NA,NA,8,6,3,7,NA,NA,NA,NA)
var4 <- c(NA,NA,NA,NA,5,NA,2,6,2,9)
df <- data.frame(var1, var2, var3, var4)
# transpose and (unsuccessful) attempt to remove leading NAs
t.df <- t(df)
t.df <- na.trim(t.df, sides = 'left')
We can loop over the columns (lapply(..) and apply na.trim. Then, pad NAs at the end of the each of the list elements by assigning length as the maximum length from the list elements.
library(zoo)
lst <- lapply(df, na.trim)
df[] <- lapply(lst, `length<-`, max(lengths(lst)))
df
# var1 var2 var3 var4
#1 1 6 8 5
#2 2 2 6 NA
## 3 4 3 2
#4 4 7 7 6
#5 5 3 NA 2
#6 6 NA NA 9
#7 7 NA NA NA
#8 8 NA NA NA
#9 9 NA NA NA
#10 10 NA NA NA
Or as #G.Grothendieck mentioned in the comments
replace(df, TRUE, do.call("merge", lapply(lst, zoo)))
You can do with base functions:
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
x[c(((r$length[1]+1):length(x)), 1:r$length[1])]
}
df[,] <- lapply(df, my.na.trim)
df
# var1 var2 var3 var4
# 1 1 6 8 5
# 2 2 2 6 NA
# 3 3 4 3 2
# 4 4 7 7 6
# 5 5 3 NA 2
# 6 6 NA NA 9
# 7 7 NA NA NA
# 8 8 NA NA NA
# 9 9 NA NA NA
# 10 10 NA NA NA
alternative coding for the function:
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
r1 <- r$length[1]
c(tail(x, -r1), head(x, r1))
}
We can use the cbind.na() function from the qpcR package and combine it with the na.trim() function from the zoo package:
do.call(qpcR:::cbind.na, lapply(df, zoo::na.trim))
# var1 var2 var3 var4
# [1,] 1 6 8 5
# [2,] 2 2 6 NA
# [3,] 3 4 3 2
# [4,] 4 7 7 6
# [5,] 5 3 NA 2
# [6,] 6 NA NA 9
# [7,] 7 NA NA NA
# [8,] 8 NA NA NA
# [9,] 9 NA NA NA
#[10,] 10 NA NA NA
If speed is a matter you can use this data.table solution.
library(data.table)
dt_foo <- function(dt) {
shift_v <- sapply(dt, function(col) min(which(+(is.na(col)) == 0))-1)
shift_expr <- parse(text = paste0("list(", paste("shift(", names(shift_v), ", n = ", shift_v, ", type = 'lead')", collapse = ", "), ")"))
dt[, names(shift_v) := eval(shift_expr), with = F]
dt[]
}
Some benchmarking follows.
library(zoo)
library(microbenchmark)
set.seed(1)
DT <- as.data.table(matrix(sample(c(0:9L, NA), 1e8, T, prob = c(rep(.01, 10), .9)), ncol = 1000))
zoo_foo <- function(df) {
lst <- lapply(df, na.trim)
df[] <- lapply(lst, `length<-`, max(lengths(lst)))
df
}
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
x[c(((r$length[1]+1):length(x)), 1:r$length[1])]
}
microbenchmark(dt_foo(copy(DT)), zoo_foo(DT),
as.data.frame(lapply(DT, my.na.trim)), times = 10)
Unit: seconds
expr min lq mean median uq max neval cld
dt_foo(copy(DT)) 1.468749 1.618289 1.690293 1.699926 1.725534 1.893018 10 a
zoo_foo(DT) 6.493227 6.516247 6.834768 6.779045 7.190705 7.319058 10 c
as.data.frame(lapply(DT, my.na.trim)) 4.988514 5.013340 5.384399 5.385273 5.508889 6.517748 10 b

Calculate cumsum() while ignoring NA values

Consider the following named vector x.
( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
# a b c d e f g h
# 1 2 0 NA 4 NA NA 6
I'd like to calculate the cumulative sum of x while ignoring the NA values. Many R functions have an argument na.rm which removes NA elements prior to calculations. cumsum() is not one of them, which makes this operation a bit tricky.
I can do it this way.
y <- setNames(numeric(length(x)), names(x))
z <- cumsum(na.omit(x))
y[names(y) %in% names(z)] <- z
y[!names(y) %in% names(z)] <- x[is.na(x)]
y
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
But this seems excessive, and makes a lot of new assignments/copies. I'm sure there's a better way.
What better methods are there to return the cumulative sum while effectively ignoring NA values?
You can do this in one line with:
cumsum(ifelse(is.na(x), 0, x)) + x*0
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
Or, similarly:
library(dplyr)
cumsum(coalesce(x, 0)) + x*0
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
It's an old question but tidyr gives a new solution.
Based on the idea of replacing NA with zero.
require(tidyr)
cumsum(replace_na(x, 0))
a b c d e f g h
1 3 3 3 7 7 7 13
Do you want something like this:
x2 <- x
x2[!is.na(x)] <- cumsum(x2[!is.na(x)])
x2
[edit] Alternatively, as suggested by a comment above, you can change NA's to 0's -
miss <- is.na(x)
x[miss] <- 0
cs <- cumsum(x)
cs[miss] <- NA
# cs is the requested cumsum
Here's a function I came up from the answers to this question. Thought I'd share it, since it seems to work well so far. It calculates the cumulative FUNC of x while ignoring NA. FUNC can be any one of sum(), prod(), min(), or max(), and x is a numeric vector.
cumSkipNA <- function(x, FUNC)
{
d <- deparse(substitute(FUNC))
funs <- c("max", "min", "prod", "sum")
stopifnot(is.vector(x), is.numeric(x), d %in% funs)
FUNC <- match.fun(paste0("cum", d))
x[!is.na(x)] <- FUNC(x[!is.na(x)])
x
}
set.seed(1)
x <- sample(15, 10, TRUE)
x[c(2,7,5)] <- NA
x
# [1] 4 NA 9 14 NA 14 NA 10 10 1
cumSkipNA(x, sum)
# [1] 4 NA 13 27 NA 41 NA 51 61 62
cumSkipNA(x, prod)
# [1] 4 NA 36 504 NA 7056 NA
# [8] 70560 705600 705600
cumSkipNA(x, min)
# [1] 4 NA 4 4 NA 4 NA 4 4 1
cumSkipNA(x, max)
# [1] 4 NA 9 14 NA 14 NA 14 14 14
Definitely nothing new, but maybe useful to someone.
Another option is using the collapse package with fcumsum function like this:
( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
#> a b c d e f g h
#> 1 2 0 NA 4 NA NA 6
library(collapse)
fcumsum(x)
#> a b c d e f g h
#> 1 3 3 NA 7 NA NA 13
Created on 2022-08-24 with reprex v2.0.2

Resources