I've got a very large dataset (millions of rows that I need to loop through thousands of times), and during the loop I have to do a conditional sum that appears to be taking a very long time. Is there a way of making this more efficient?
Datatable format as follows:
DT <- data.table('A' = c(1,1,1,2,2,3,3,3,3,4),
'B' = c(500,510,540,500,540,500,510,519,540,500),
'C' = c(10,20,10,20,10,50,20,50,20,10))
A
B
C
1
500
10
1
510
20
1
540
10
2
500
20
2
540
10
3
500
50
3
510
20
3
519
50
3
540
20
4
500
10
I need the sum of column C (in a new column, D) subject to A == A, and B >= B & B < B + 20 (by row). So the output table would look like the following:
A
B
C
D
1
500
10
30
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
120
3
510
20
120
3
519
50
120
3
540
20
20
4
500
10
10
The code I'm currently using:
DT[,D:= sum(DT$C[A == DT$A & ((B >= DT$B) & (B < DT$B + 20))]), by=c('A', 'B')]
This takes a very long time to actually run, as well as giving me the wrong answer. The output I get looks like this:
A
B
C
D
1
500
10
10
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
50
3
510
20
70
3
519
50
120
3
540
20
20
4
500
10
10
(i.e. D only appears to increase cumulatively).
I'm less concerned with the cumulative thing, more about speed. Ultimately what I'm trying to get to is the largest sum of C, by A, subject to B being within 20 of eachother. I would really appreciate any help on this! Thanks in advance.
If I understand correctly, this can be solved by a non-equi self join:
DT[, Bp20 := B + 20][
DT, on = .(A, B >= B, B < Bp20), mult = "last"][
, .(B, C = i.C, D = sum(i.C)), by = .(A, Bp20)][
, Bp20 := NULL][]
A B C D
1: 1 500 10 30
2: 1 510 20 30
3: 1 540 10 10
4: 2 500 20 20
5: 2 540 10 10
6: 3 500 50 120
7: 3 510 20 120
8: 3 519 50 120
9: 3 540 20 20
10: 4 500 10 10
# logic for B
DT[, g := B >= shift(B) & B < shift(B, 1) + 20, by = A]
# creating index column
DT[, gi := !g]
DT[is.na(gi), gi := T]
DT[, gi := cumsum(gi)]
DT[, D := sum(C), by = gi] # summing by new groups
DT
# A B C g gi D
# 1: 1 500 10 NA 1 30
# 2: 1 510 20 TRUE 1 30
# 3: 1 540 10 FALSE 2 10
# 4: 2 500 20 NA 3 20
# 5: 2 540 10 FALSE 4 10
# 6: 3 500 50 NA 5 120
# 7: 3 510 20 TRUE 5 120
# 8: 3 519 50 TRUE 5 120
# 9: 3 540 20 FALSE 6 20
# 10: 4 500 10 NA 7 10
You might need to adjust logic for B, as all edge cases isn't clear from the question... if for one A value we have c(30, 40, 50, 60), all of those rows are in one group?
Related
I would like to turn the first table into the second by selecting the last observation of a group for a and b, the first observation for c, sum each observation for the group for d and e, and for f, check if a valid date exists and use that date.
Table 1:
ID a b c d e f
1 10 100 1000 10000 100000 ?
1 10 100 1001 10010 100100 5/07/1977
1 11 111 1002 10020 100200 5/07/1977
2 22 222 2000 20000 200000 6/02/1980
3 33 333 3000 30000 300000 20/12/1978
3 33 333 3001 30010 300100 ?
4 40 400 4000 40000 400000 ?
4 40 400 4001 40010 400100 ?
4 40 400 4002 40020 400200 7/06/1944
4 44 444 4003 40030 400300 ?
4 44 444 4004 40040 400400 ?
4 44 444 4005 40050 400500 ?
5 55 555 5000 50000 500000 31/05/1976
5 55 555 5001 50010 500100 31/05/1976
Table 2:
ID a b c d e f
1 11 111 1000 30030 300300 5/07/1977
2 22 222 2000 20000 200000 6/02/1980
3 33 333 3000 60010 600100 20/12/1978
4 44 444 4000 240150 2401500 7/06/1944
5 55 555 5000 100010 1000100 31/05/1976
I have looked up StackOverflow questions and I have only seen elements of this. I can do a through to e in the following steps.
library(data.table)
setwd('D:/Work/BRB/StackOverflow')
DT = data.table(fread('datatable.csv', header=TRUE))
AB = DT[ , .SD[.N], ID ]
AB = AB[ , c('a', 'b') ]
C = DT[ , .SD[1], ID ]
C = C[ , 'c' ]
DE = DT[ , .(d = sum(d), e = sum(e)) , by = ID ]
Final = cbind(AB, C, DE)
Final
My question is, can I do the operations on variables a, b, c, d, e in one transformation without having to split it into 3?
Also, I have no idea how to do f. Any suggestions?
Finally, I am new to R. Anything else I can improve about my code?
There are several things you can improve:
fread will return a data.table, so no need to wrap it in data.table. You can check with class(DT).
Use the na.strings parameter when reading in the data. See below for an example.
Summarise with:
DT[, .(a = a[.N],
b = b[.N],
c = c[1],
d = sum(d),
e = sum(e),
f = unique(na.omit(f)))
, by = ID]
you will then get:
ID a b c d e f
1: 1 11 111 1000 30030 300300 5/07/1977
2: 2 22 222 2000 20000 200000 6/02/1980
3: 3 33 333 3000 60010 600100 20/12/1978
4: 4 44 444 4000 240150 2401500 7/06/1944
5: 5 55 555 5000 100010 1000100 31/05/1976
Some explanations & other notes:
Subsetting with [1] will give you the first value of a group. You could also use the first-function which is optimized in data.table, and thus faster.
Subsetting with [.N] will give you the last value of a group. You could also use the last-function which is optimized in data.table, and thus faster.
Don't use variable names that are also functions in R (in this case, don't use c as a variable name). See also ?c for an explanation of what the c-function does.
For summarising the f-variable, I used unique in combination with na.omit. If there is more than one unique date by ID, you could also use for example na.omit(f)[1].
If speed is an issue, you could optimize the above to (thx to #Frank):
DT[order(f)
, .(a = last(a),
b = last(b),
c = first(c),
d = sum(d),
e = sum(e),
f = first(f))
, by = ID]
Ordering by f will put NA-values last. As a result now the internal GForce-optimization is used for all calculations.
Used data:
DT <- fread("ID a b c d e f
1 10 100 1000 10000 100000 ?
1 10 100 1001 10010 100100 5/07/1977
1 11 111 1002 10020 100200 5/07/1977
2 22 222 2000 20000 200000 6/02/1980
3 33 333 3000 30000 300000 20/12/1978
3 33 333 3001 30010 300100 ?
4 40 400 4000 40000 400000 ?
4 40 400 4001 40010 400100 ?
4 40 400 4002 40020 400200 7/06/1944
4 44 444 4003 40030 400300 ?
4 44 444 4004 40040 400400 ?
4 44 444 4005 40050 400500 ?
5 55 555 5000 50000 500000 31/05/1976
5 55 555 5001 50010 500100 31/05/1976", na.strings='?')
We can use tidyverse. After grouping by 'ID', we summarise the columns based on the first or last observation
library(dplyr)
DT %>%
group_by(ID) %>%
summarise(a = last(a),
b = last(b),
c = first(c),
d = sum(d),
e = sum(e),
f = f[f!="?"][1])
# A tibble: 5 × 7
# ID a b c d e f
# <int> <int> <int> <int> <int> <int> <chr>
#1 1 11 111 1000 30030 300300 5/07/1977
#2 2 22 222 2000 20000 200000 6/02/1980
#3 3 33 333 3000 60010 600100 20/12/1978
#4 4 44 444 4000 240150 2401500 7/06/1944
#5 5 55 555 5000 100010 1000100 31/05/1976
Here is my data :
class x1 x2
c 6 90
b 5 50
c 3 70
b 9 40
a 5 30
b 1 60
a 7 20
c 4 80
a 2 10
I first want to order it by class (increasing or decreasing doesn't really matter) and then by x1 (decreasing), so I do the following :
df <- df[with(df, order(class, x1, decreasing = TRUE))]
class x1 x2
c 6 90
c 4 80
c 3 70
b 9 40
b 5 50
b 1 60
a 7 20
a 5 30
a 2 10
And then I would like the cumulative sum over x1 for each class :
class x1 x2 cumsum
c 6 90 90
c 4 80 170 # 90+80
c 3 70 240 # 90+80+70
b 9 40 40
b 5 50 90 # 40+50
b 1 60 150 # 40+50+60
a 7 20 20
a 5 30 50 # 20+30
a 2 10 60 # 20+30+10
Following this answer, I did this :
df$cumsum <- unlist(by(df$x2, df$class, cumsum))
# (Also tried this, same result)
df$cumsum <- unlist(by(df[,x2], df[,class], cumsum))
But what I get is a cumulative sum over the whole set + misordered. To be more specific, Here is what I get :
class x1 x2 cumsum
c 6 90 20 # this cumsum
c 4 80 50 # and this cumsum
c 3 70 60 # and this cumsum are the cumsum of the lines of class a,
b 9 40 100 # then it adds the 'x2' values of class b : 60 ('cumsum' from the previous line) + 40
b 5 50 150 # and keeps doing so : 100 + 50
b 1 60 210 # 150 + 60
a 7 20 300 # 210 + 90
a 5 30 380 # 300 + 80
a 2 10 450 # 380 + 70
Any idea on how I could solve this ? Thanks
dplyr can work here too
library(dplyr)
df %>%
group_by(class) %>%
arrange(desc(x1)) %>%
mutate(cumsum=cumsum(x2))
## class x1 x2 cumsum
## (fctr) (int) (int) (int)
## 1 a 7 20 20
## 2 a 5 30 50
## 3 a 2 10 60
## 4 b 9 40 40
## 5 b 5 50 90
## 6 b 1 60 150
## 7 c 6 90 90
## 8 c 4 80 170
## 9 c 3 70 240
As described here (https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) and elsewhere, the group_by in conjunction with arrange implies that the data will be sorted by the grouping variable first.
We can use data.table
library(data.table)
setDT(df)[, x2:= cumsum(x2) , class]
df
# class x1 x2
#1: c 6 90
#2: c 4 170
#3: c 3 240
#4: b 9 40
#5: b 5 90
#6: b 1 150
#7: a 7 20
#8: a 5 50
#9: a 2 60
NOTE: In the above I used the ordered data
If we need to order also,
setorder(setDT(df), -class, -x1)[, x2:=cumsum(x2), class]
You can use base R transform and ave to cumsum over the class column
transform(df[order(df$class, decreasing = T), ], cumsum = ave(x2, class, FUN=cumsum))
# class x1 x2 cumsum
#1 c 6 90 90
#3 c 3 70 160
#8 c 4 80 240
#2 b 5 50 50
#4 b 9 40 90
#6 b 1 60 150
#5 a 5 30 30
#7 a 7 20 50
#9 a 2 10 60
I have a data.table that looks like this:
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT
A B C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
...
20: 20 200 2000
I want to be able to calculate a new column "D" that has the first value as the average of the first 20 rows in column B as the first value, and then I want to use the first row of column D to help calculate the next row value of D.
Say the Average of the first 20 rows of column B is 105. and the formula for the next row in column D is this : DT$D[1]+DT$C[2]
where I take the previous row value of D and add the row value of C.
The third row will then look like this: DT$D[2]+DT$C[3]
A B C D
1: 1 10 100 105
2: 2 20 200 305
3: 3 30 300 605
4: 4 40 400 1005
5: 5 50 500 1505
...
20: 20 200 2000 21005
Any ideas on this would be made?
I think shift would be a great help to lag, but dont know how to get rid of the NA that it produces at the first instance?
We can take the mean of the first 20 rows of column B and add the cumulative sum of C. The cumulative sum has one special consideration that we want to add a concatenation of 0 and column C without the first value.
DT[, D := mean(B[1:20]) + cumsum(c(0, C[-1]))][]
# A B C D
# 1: 1 10 100 105
# 2: 2 20 200 305
# 3: 3 30 300 605
# 4: 4 40 400 1005
# 5: 5 50 500 1505
# 6: 6 60 600 2105
# 7: 7 70 700 2805
# 8: 8 80 800 3605
# 9: 9 90 900 4505
# 10: 10 100 1000 5505
# 11: 11 110 1100 6605
# 12: 12 120 1200 7805
# 13: 13 130 1300 9105
# 14: 14 140 1400 10505
# 15: 15 150 1500 12005
# 16: 16 160 1600 13605
# 17: 17 170 1700 15305
# 18: 18 180 1800 17105
# 19: 19 190 1900 19005
# 20: 20 200 2000 21005
I have a data.table as below:
order products value
1000 A|B 10
2000 B|C 20
3000 A|C 30
4000 B|C|D 5
5000 C|D 15
And I need to break the column products and transform/normalize to be used like this:
order prod.seq prod.name value
1000 1 A 10
1000 2 B 10
2000 1 B 20
2000 2 C 20
3000 1 A 30
3000 2 C 30
4000 1 B 5
4000 2 C 5
4000 3 D 5
5000 1 C 15
5000 2 D 15
I guess I can do it using a custom FOR/LOOP but I'd like to know a more advanced way to do that using apply,ddply methods. Any suggestions?
First, convert to a character/string:
DT[,products:=as.character(products)]
Then you can split the string:
DT[,{
x = strsplit(products,"\\|")[[1]]
list( prod.seq = seq_along(x), prod_name = x )
}, by=.(order,value)]
which gives
order value prod.seq prod_name
1: 1000 10 1 A
2: 1000 10 2 B
3: 2000 20 1 B
4: 2000 20 2 C
5: 3000 30 1 A
6: 3000 30 2 C
7: 4000 5 1 B
8: 4000 5 2 C
9: 4000 5 3 D
10: 5000 15 1 C
11: 5000 15 2 D
Here is the another option
library(splitstackshape)
out = cSplit(dat, "products", "|", direction = "long")
out[, prod.seq := seq_len(.N), by = value]
#> out
# order products value prod.seq
# 1: 1000 A 10 1
# 2: 1000 B 10 2
# 3: 2000 B 20 1
# 4: 2000 C 20 2
# 5: 3000 A 30 1
# 6: 3000 C 30 2
# 7: 4000 B 5 1
# 8: 4000 C 5 2
# 9: 4000 D 5 3
#10: 5000 C 15 1
#11: 5000 D 15 2
After cSplit step, using ddply
library(plyr)
ddply(out, .(value), mutate, prod.seq = seq_len(length(order)))
using dplyr
library(dplyr)
out %>% group_by(value) %>% mutate(prod.seq = row_number(order))
using lapply
rbindlist(lapply(split(out, out$value),
function(x){x$prod.seq = seq_len(length(x$order));x}))
I have the following code:
> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")
> dt
a b c d
1: 3 1 11 21
2: 3 2 12 22
3: 3 3 13 23
4: 3 4 14 24
5: 3 5 15 25
6: 4 6 16 26
7: 4 7 17 27
8: 4 8 18 28
9: 4 9 19 29
10: 4 10 20 30
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))'
Starting dogroups ... done dogroups in 0 secs
a b c d
1: 3 15 65 115
2: 4 40 90 140
> dt[,c(count=.N,lapply(.SD,sum)),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))'
Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
done dogroups in 0 secs
a count b c d
1: 3 5 15 65 115
2: 4 5 40 90 140
How do I avoid the scary "very inefficient" warning?
I can add the count column before the join:
> dt$count <- 1
> dt
a b c d count
1: 3 1 11 21 1
2: 3 2 12 22 1
3: 3 3 13 23 1
4: 3 4 14 24 1
5: 3 5 15 25 1
6: 4 6 16 26 1
7: 4 7 17 27 1
8: 4 8 18 28 1
9: 4 9 19 29 1
10: 4 10 20 30 1
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))'
Starting dogroups ... done dogroups in 0 secs
a b c d count
1: 3 15 65 115 5
2: 4 40 90 140 5
but this does not look too elegant...
One way I could think of is to assign count by reference:
dt.out <- dt[, lapply(.SD,sum), by = a]
dt.out[, count := dt[, .N, by=a][, N]]
# alternatively: count := table(dt$a)
# a b c d count
# 1: 3 15 65 115 5
# 2: 4 40 90 140 5
Edit 1: I still think it's just message and not a warning. But if you still want to avoid that, just do:
dt.out[, count := as.numeric(dt[, .N, by=a][, N])]
Edit 2: Very interesting. Doing the equivalent of multiple := assignment does not produce the same message.
dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# Detected that j uses these columns: a
# Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
# Detected that j uses these columns: <none>
# Optimization is on but j left unchanged as '.N'
# Starting dogroups ... done dogroups in 0 secs
# Detected that j uses these columns: N
# Assigning to all 2 rows
# Direct plonk of unnamed RHS, no copy.
This solution removes the message about the named elements. But you have to put the names back afterwards.
require(data.table)
options(datatable.verbose = TRUE)
dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")
dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]
Output
> dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(.N, unname(lapply(.SD, sum)))'
Starting dogroups ... done dogroups in 0.001 secs
a V1 V2 V3 V4
1: 3 5 15 65 115
2: 4 5 40 90 140