aggregate data.table to rows of intervals of original values

aggregate data.table to rows of intervals of original values - r

I have some data.table with an amount column like:
n = 1e5
set.seed(1)
dt <- data.table(id = 1:n, amount = pmax(0,rnorm(n, mean = 5e3, sd = 1e4)))
And a vector of breaks given like:
breaks <- as.vector( c(0, t(sapply(c(1, 2.5, 5, 7.5), function(x) x * 10^(1:4))) ) )
For each interval defined by these breaks, I want to use data.table syntax to:
get counts of amount contained
get counts of amount equal to or greater than the left bound (basically n * (1-cdf(amount))
For 1, this mostly works, but doesn't return rows for the empty intervals:
dt[, .N, keyby = breaks[findInterval(amount,breaks)] ] #would prefer to get 0 for empty intvl
For 2, I tried:
dt[, sum(amount >= thresh[.GRP]), keyby = breaks[findInterval(amount,breaks)] ]
but it didn't work because sum is restricted to within the group, not beyond. So came up with a workaround, which also returns the empty intervals:
dt[, cbind(breaks, sapply(breaks, function(x) sum(amount >= x)))] # desired result
So, what's the data.table way to fix my 2. and to get the empty intervals for both?

I would consider doing this:
mybreaks = c(-Inf, breaks, Inf)
dt[, g := cut(amount, mybreaks)]
dt[.(g = levels(g)), .N, on="g", by=.EACHI]
g N
1: (-Inf,0] 30976
2: (0,10] 23
3: (10,25] 62
4: (25,50] 73
5: (50,75] 85
6: (75,100] 88
7: (100,250] 503
8: (250,500] 859
9: (500,750] 916
10: (750,1e+03] 912
11: (1e+03,2.5e+03] 5593
12: (2.5e+03,5e+03] 9884
13: (5e+03,7.5e+03] 9767
14: (7.5e+03,1e+04] 9474
15: (1e+04,2.5e+04] 28434
16: (2.5e+04,5e+04] 2351
17: (5e+04,7.5e+04] 0
18: (7.5e+04, Inf] 0
You can use cumsum if you want the CDF.

Related

Warnings in `.SD` when selecting columns named in a variable

Assuming I have a data.table as below
DT <- data.table(x = rep(c("b", "a", "c"), each = 3), v = c(1, 1, 1, 2, 2, 1, 1, 2, 2), y = c(1, 3, 6), a = 1:9, b = 9:1)
> DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
I have a variable sl <- c("a","b") that selects columns to compute rowSums. If I try the code below
DT[, ab := rowSums(.SD[, ..sl])]
I am still able to get the desired output but given a warning message telling
DT[, ab := rowSums(.SD[, ..sl])]
Warning message:
In [.data.table(.SD, , ..sl) :
Both 'sl' and '..sl' exist in calling scope. Please remove the '..sl' variable in calling scope for clarity.
However, no warnings occur when running
DT[, ab := rowSums(.SD[, sl, with = FALSE])]
I am wondering how to fix the warning issue when using .SD[, ..sl]. Thanks in advance!

It may be that the syntax to use is either specify the .SDcols and call the .SD or directly call the ..cols from the original object. According to ?data.table
x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE] and to x[, .SD, .SDcols=cols]
if we check the source code of data.table, line 248 seems to be the one triggering the warning
as
DT[, exists(..sl, where = DT)]
#[1] TRUE
and
DT[, .SD[, exists(..sl)]]
#[1] TRUE
DT[, .SD[, exists(..sl, where = .SD)]]
#[1] TRUE

Calculate cummean() and cumsd() while ignoring NA values and filling NAs

My goal is to obtain the cum mean (and cumsd) of a dataframe while ignoring NA and filling those with the previous cum means:
df:
var1 var2 var3
x1 y1 z1
x2 y2 z2
NA NA NA
x3 y3 z3
cummean:
var1 var2 var3
x1/1 y1/1 z1/1
(x1+x2)/2 (y1+y2)/2 (z1+z2)/2
(x1+x2)/2 (y1+y2)/2 (z1+z2)/2
(x1+x2+x3)/3 (y1+y2+y3)/3 (z1+z2+z3)/3
So for row 3 where df has NA, I want the new matrix to contain the cum mean from the line above (numerator should not increase).
So far, I am using this to compute the cum mean (I am aware that somewhere a baby seal gets killed because I used a for loop and not something from the apply family)
for(i in names(df){
df[i][!is.na(df[i])] <- GMCM:::cummean(df[i][!is.na(df[i])])
}
I have also tried this:
setDT(posRegimeReturns)
cols<-colnames((posRegimeReturns))
posRegimeReturns[, (cols) := lapply(.SD, cummean) , .SD = cols]
But both of those leave the NAs empty.
Note: this question is similar to this post Calculate cumsum() while ignoring NA values
but unlike the solution there, I don't want to leave the NAs but rather fill those with the same values as the last row above that was not NA.

You might want to use the definition of variance to calculate this
library(data.table)
dt <- data.table(V1=c(1,2,NA,3), V2=c(1,2,NA,3), V3=c(1,2,NA,3))
cols <- copy(names(dt))
#means
dt[ , paste0("mean_",cols) := lapply(.SD, function(x) {
#get the num of non-NA observations
lens <- cumsum(!is.na(x))
#set NA to 0 before doing cumulative sum
x[is.na(x)] <- 0
cumsum(x) / lens
}), .SDcols=cols]
#sd
dt[ , paste0("sd_",cols) := lapply(.SD, function(x) {
lens <- cumsum(!is.na(x))
x[is.na(x)] <- 0
#use defn of variance mean of sum of squares minus square of means and also n-1 in denominator
sqrt(lens/(lens-1) * (cumsum(x^2)/lens - (cumsum(x) / lens)^2))
}), .SDcols=cols]

Using data table. In particular:
library(data.table)
DT <- data.table(z = sample(N),idx=1:N,key="idx")
z idx
1: 4 1
2: 10 2
3: 9 3
4: 6 4
5: 1 5
6: 8 6
7: 3 7
8: 7 8
9: 5 9
10: 2 10
We now make use of the use of -apply function and data.table.
DT[,cummean:=sapply(seq(from=1,to=nrow(DT)) ,function(iii) mean(DT$z[1:iii],na.rm = TRUE))]
DT[,cumsd:=sapply(seq(from=1,to=nrow(DT)) ,function(iii) sd(DT$z[1:iii],na.rm = TRUE))]
resulting in:
z idx cummean cumsd
1: 4 1 4.000000 NA
2: 10 2 7.000000 4.242641
3: 9 3 7.666667 3.214550
4: 6 4 7.250000 2.753785
5: 1 5 6.000000 3.674235
6: 8 6 6.333333 3.386247
7: 3 7 5.857143 3.338092
8: 7 8 6.000000 3.116775
9: 5 9 5.888889 2.934469
10: 2 10 5.500000 3.027650

Replace NA with a value that is row and column specific [duplicate]

This question already has answers here:
Fastest way to replace NAs in a large data.table
(10 answers)
Closed 5 years ago.
A lot comes together in this question. First off all I would like to segment the data by column c. The subsets are given by the factor c: the levels are 1 to 4. So 4 distinct segments.
Next I have two columns. Column a and b.
I would like to replace the NA's with the maximum value of each segment specific column. So for example, NA at row 3 and column 'a', this would be 30. (b,3) would be 80, (b,8) would be 50 and (a, 5) would be 80.
I have created the code below that does the job, but now I need to make it automatic (like a for loop) for all segments and columns. How could I do this?
a <- c(10,NA,30,40,NA,60,70,80,90,90,80,90,10,40)
b <- c(80,70,NA,50,40,30,20,NA,0,0,10,69, 40, 90)
c <- c(1,1,1,2,2,2,2,2,3,3,3,4,4,4)
a b c
1: 10 80 1
2: NA 70 1
3: 30 NA 1
4: 40 50 2
5: NA 40 2
6: 60 30 2
7: 70 20 2
8: 80 NA 2
9: 90 0 3
10: 90 0 3
11: 80 10 3
12: 90 69 4
13: 10 40 4
14: 40 90 4
mytable <- data.table(a,b,c)
mytable[which(is.na(mytable[c == 1][,1, with = FALSE]) == TRUE),1] <- max(mytable[c==1,1], na.rm = TRUE)
Unfortunately, this try results in an error:
for(i in unique(mytable$c)){
for(j in unique(c(1:2))){
mytable[which(is.na(mytable[c == i][,j, with = FALSE]) == TRUE),j, with = FALSE] <- max(mytable[c==i][,j, with = FALSE], na.rm = TRUE)
}
}
Error in [<-.data.table(*tmp*, which(is.na(mytable[c == i][, j, with = FALSE]) == :
unused argument (with = FALSE)
Surprisingly, this results in an error as well:
for(i in unique(mytable$c)){
for(j in unique(c(1:2))){
mytable[which(is.na(mytable[c == i][,j]) == TRUE),j] <- max(mytable[c==i,j], na.rm = TRUE)
}
}
Error in [.data.table(mytable, c == i, j) :
j (the 2nd argument inside [...]) is a single symbol but column name 'j' is not found. Perhaps you intended DT[,..j] or DT[,j,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.

library("data.table")
mytable <- data.table(
a=c(10,NA,30,40,NA,60,70,80,90,90,80,90,10,40),
b=c(80,70,NA,50,40,30,20,NA,0,0,10,69, 40, 90),
c=c(1,1,1,2,2,2,2,2,3,3,3,4,4,4))
foo <- function(x) { x[is.na(x)] <- max(x, na.rm=TRUE); x }
mytable[, .(A=foo(a), B=foo(b)), by=c]
result:
> mytable[, .(A=foo(a), B=foo(b)), by=c]
# c A B
# 1: 1 10 80
# 2: 1 30 70
# 3: 1 30 80
# 4: 2 40 50
# 5: 2 80 40
# 6: 2 60 30
# 7: 2 70 20
# 8: 2 80 50
# 9: 3 90 0
#10: 3 90 0
#11: 3 80 10
#12: 4 90 69
#13: 4 10 40
#14: 4 40 90
or for direct substitution of a and b:
mytable[, `:=`(a=foo(a), b=foo(b)), by=c] # or
mytable[, c("a", "b") := (lapply(.SD, foo)), by = c] # from #Sotos
or the safer variant (tnx to #Frank for the remark):
cols <- c("a", "b")
mytable[, (cols) := lapply(.SD, foo), by=c, .SDcols=cols]

Using data.table
library(data.table)
mytable[, a := ifelse(is.na(a), max(a, na.rm = TRUE), a), by = c]
mytable[, b := ifelse(is.na(b), max(b, na.rm = TRUE), b), by = c]
Or in a single command
mytable[, c("a", "b") := lapply(.SD, function(x) ifelse(is.na(x), max(x, na.rm = TRUE), x)), .SDcols = c("a", "b"), by = c]

Use ddply() from package plyr:
df<-data.frame(a,b,c=as.factor(c))
library(plyr)
df2<-ddply(df, .(c), transform, a=ifelse(is.na(a), max(a, na.rm=T),a),
b=ifelse(is.na(b), max(b, na.rm=T),b))

Use previous calculated row value in r Continued

I have a data.table that looks like this:
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT
A B C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
...
20: 20 200 2000
I want to be able to calculate a new column "G" that has the first value as the average of the first 20 rows in column B as the first value, and then I want to use the first row of column G to help calculate the next row value of G.
Say the Average of the first 20 rows of column B is 105, and the formula for the next row in G is: DT$G[2] = DT$G[1]*2, and the next row again is DT$G[3]=DT$G[2]*2. This means that the first value should not be used again in the next row and so forth.
A B C G
1: 1 10 100 105
2: 2 20 200 210
3: 3 30 300 420
4: 4 40 400 840
5: 5 50 500 1680
...
20: 20 200 2000 55050240
Any ideas on this would be made?

You can do this with a little arithmetic:
DT$G <- mean(DT$B[1:20])
DT$G <- DT$G * cumprod(rep(2,nrow(DT)))/2
Or using data.table syntax, courtesy of #DavidArenburg:
DT[ , G := mean(B[1:20]) * cumprod(rep(2, .N)) / 2]
or from #Frank
DT$G <- cumprod(c( mean(head(DT$B,20)), rep(2,nrow(DT)-1) ))

mycalc <- function(x, n) {
y <- numeric(n)
y[1] <- mean(x)
for (i in 2:n) y[i] <- 2*y[i-1]
y
}
DT[ , G := mycalc(B[1:20], .N)]

Operate in defined number of rows of a data.table

I am working with a data table that has groups of data and for each a position (from -1000 to +1000) and a count for each position. A small example looks this this:
dt.ex <- data.table(newID=rep(c("A","B"), each = 6), pos=rep(c(-2:3), 2), count= sample(c(1:100), 12))
newID pos count
1: A -2 29
2: A -1 32
3: A 0 33
4: A 1 45
5: A 2 51
6: A 3 26
7: B -2 22
8: B -1 79
9: B 0 2
10: B 1 48
11: B 2 87
12: B 3 38
What I want to do is to calculate the mean (or sum) between every n rows for each group of newID. That is, split into n rows and aggregate the results. This would be output assuming n=3 and summing:
newID pos count
A -2 94
A 1 122
B -2 103
B 1 173
And I honestly have no idea on how to start without resorting some kind of looping - not advisable for a 67094000 x 3 table. If I wanted to calculate per newID only, something like this would do the trick but I am yet to see a solution that comes close to answering my question. Plyr solutions are also welcome although I feel it might be too slow for this.

An alternate way (without using .SD) would be:
dt.ex[, seq := (seq_len(.N)-1) %/% 3, by=newID][,
list(pos = mean(pos), count=sum(count)), list(newID, seq)]
Benchmarking on (relatively) bigger data:
set.seed(45)
get_grps <- function() paste(sample(letters, 5, TRUE), collapse="")
grps <- unique(replicate(1e4, get_grps()))
dt.in <- data.table(newID = sample(grps, 6e6, TRUE),
pos = sample(-1000:1000, 6e6, TRUE),
count = runif(6e6))
setkey(dt.in, newID)
require(microbenchmark)
eddi <- function(dt) {
dt[, .SD[, list(pos = mean(pos), count = sum(count)),
by = seq(0, .N-1) %/% 3], by = newID]
}
arun <- function(dt) {
dt[, seq := (seq_len(.N)-1) %/% 3, by=newID][,
list(pos = mean(pos), count=sum(count)), list(newID, seq)]
}
microbenchmark(o1 <- eddi(copy(dt.in)), o2 <- arun(copy(dt.in)), times=2)
Unit: seconds
expr min lq median uq max neval
o1 <- eddi(copy(dt.in)) 25.23282 25.23282 26.16009 27.08736 27.08736 2
o2 <- arun(copy(dt.in)) 13.59597 13.59597 14.41190 15.22783 15.22783 2

Try this:
dt.ex[, .SD[, list(pos = mean(pos), count = sum(count)),
by = seq(0, .N-1) %/% 3],
by = newID]
Note that the parent data.table's .N is used in the nested by, because .N only exists in the j-expression.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

aggregate data.table to rows of intervals of original values - r

Related

Warnings in `.SD` when selecting columns named in a variable

Calculate cummean() and cumsd() while ignoring NA values and filling NAs

Replace NA with a value that is row and column specific [duplicate]

Use previous calculated row value in r Continued

Operate in defined number of rows of a data.table

Categories

Resources