I've got a very large dataset (millions of rows that I need to loop through thousands of times), and during the loop I have to do a conditional sum that appears to be taking a very long time. Is there a way of making this more efficient?
Datatable format as follows:
DT <- data.table('A' = c(1,1,1,2,2,3,3,3,3,4),
'B' = c(500,510,540,500,540,500,510,519,540,500),
'C' = c(10,20,10,20,10,50,20,50,20,10))
A
B
C
1
500
10
1
510
20
1
540
10
2
500
20
2
540
10
3
500
50
3
510
20
3
519
50
3
540
20
4
500
10
I need the sum of column C (in a new column, D) subject to A == A, and B >= B & B < B + 20 (by row). So the output table would look like the following:
A
B
C
D
1
500
10
30
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
120
3
510
20
120
3
519
50
120
3
540
20
20
4
500
10
10
The code I'm currently using:
DT[,D:= sum(DT$C[A == DT$A & ((B >= DT$B) & (B < DT$B + 20))]), by=c('A', 'B')]
This takes a very long time to actually run, as well as giving me the wrong answer. The output I get looks like this:
A
B
C
D
1
500
10
10
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
50
3
510
20
70
3
519
50
120
3
540
20
20
4
500
10
10
(i.e. D only appears to increase cumulatively).
I'm less concerned with the cumulative thing, more about speed. Ultimately what I'm trying to get to is the largest sum of C, by A, subject to B being within 20 of eachother. I would really appreciate any help on this! Thanks in advance.
If I understand correctly, this can be solved by a non-equi self join:
DT[, Bp20 := B + 20][
DT, on = .(A, B >= B, B < Bp20), mult = "last"][
, .(B, C = i.C, D = sum(i.C)), by = .(A, Bp20)][
, Bp20 := NULL][]
A B C D
1: 1 500 10 30
2: 1 510 20 30
3: 1 540 10 10
4: 2 500 20 20
5: 2 540 10 10
6: 3 500 50 120
7: 3 510 20 120
8: 3 519 50 120
9: 3 540 20 20
10: 4 500 10 10
# logic for B
DT[, g := B >= shift(B) & B < shift(B, 1) + 20, by = A]
# creating index column
DT[, gi := !g]
DT[is.na(gi), gi := T]
DT[, gi := cumsum(gi)]
DT[, D := sum(C), by = gi] # summing by new groups
DT
# A B C g gi D
# 1: 1 500 10 NA 1 30
# 2: 1 510 20 TRUE 1 30
# 3: 1 540 10 FALSE 2 10
# 4: 2 500 20 NA 3 20
# 5: 2 540 10 FALSE 4 10
# 6: 3 500 50 NA 5 120
# 7: 3 510 20 TRUE 5 120
# 8: 3 519 50 TRUE 5 120
# 9: 3 540 20 FALSE 6 20
# 10: 4 500 10 NA 7 10
You might need to adjust logic for B, as all edge cases isn't clear from the question... if for one A value we have c(30, 40, 50, 60), all of those rows are in one group?
I want to calculate a complementary cumulative survival count for later display in a histogramm (without using ggplot). E.g. count the number of elements surviving 4.0 years, 4.5 years, 5.0 years and so on.
Input is 10000-row dataframe with 4 different types with 4 different distributions for life expectancy:
type <- c(rep("A",1000), rep("B",2000), rep("C",3000), rep("D",4000))
age <- c(rnorm(1000,6,0.5), rnorm(2000,8,0.5), rnorm(3000,10,0.5), rnorm(4000,12,0.5))
input <- data.frame(type,age,stringsAsFactors=FALSE)
Output is built using a range for the age steps:
range <- seq(floor(min(input$age)),ceiling(max(input$age)),0.5)
I expand into a data.frame with range and types:
combns <- expand.grid(age=range,type=LETTERS[1:4], stringsAsFactors=FALSE)
And then use the apply function to count the total number exceeding each age step:
CCSC.apply.all <- apply(combns[1:length(range),],1,function(x){
sum(input$age >= x["age"]) } )
and grouped by type:
CCSC.apply.type <- apply(combns,1,function(x){
sum(
input["age"] >= x["age"] &
input["type"] == x["type"]
) } )
Within the apply function input["age"] >= x["age"] is evaluated sometimes wrong. This results in a wrong count. In the table below columns 2:6 are created using apply, columns 7:11 with a for loop. 2:6 are wrong, 7:11 correct.
> output
range all-apply A-apply B-apply C-apply D-apply all-for A-for B-for C-for D-for
1 4,0 10000 1000 2000 3000 4000 10000 1000 2000 3000 4000
2 4,5 10000 1000 2000 3000 4000 9998 998 2000 3000 4000
3 5,0 10000 1000 2000 3000 4000 9978 978 2000 3000 4000
4 5,5 10000 1000 2000 3000 4000 9843 843 2000 3000 4000
5 6,0 10000 1000 2000 3000 4000 9483 483 2000 3000 4000
6 6,5 10000 1000 2000 3000 4000 9141 143 1998 3000 4000
7 7,0 10000 1000 2000 3000 4000 8981 23 1958 3000 4000
8 7,5 10000 1000 2000 3000 4000 8690 2 1688 3000 4000
9 8,0 10000 1000 2000 3000 4000 8030 0 1030 3000 4000
10 8,5 10000 1000 2000 3000 4000 7329 0 330 2999 4000
11 9,0 10000 1000 2000 3000 4000 6989 0 43 2946 4000
12 9,5 10000 1000 2000 3000 4000 6528 0 2 2526 4000
13 10,0 10000 1000 2000 3000 4000 5494 0 0 1494 4000
14 10,5 8961 1000 2000 1967 3994 4455 0 0 461 3994
15 11,0 8485 1000 2000 1571 3914 3979 0 0 65 3914
16 11,5 7900 1000 2000 1510 3390 3394 0 0 4 3390
17 12,0 6515 1000 2000 1506 2009 2009 0 0 0 2009
18 12,5 5123 1000 2000 1506 617 617 0 0 0 617
19 13,0 4594 1000 2000 1506 88 88 0 0 0 88
20 13,5 4513 1000 2000 1506 7 7 0 0 0 7
21 14,0 4506 1000 2000 1506 0 0 0 0 0 0
Can someone tell me what is the problem with my apply function?
An additional observation: against my expectation the apply method is about factor 100 slower than the for-loop.
Please find the full R script below:
rm(list=ls())
setwd("C:/R_test")
options(OutDec= ",") # to be deleted if not applicable for locale
set.seed(1234)
# creating input - data.frame 10000 data sets to be examined
# 4 different types with 4 different normal distributions for life expectancy
type <- c(rep("A",1000), rep("B",2000), rep("C",3000), rep("D",4000))
age <- c(rnorm(1000,6,0.5), rnorm(2000,8,0.5), rnorm(3000,10,0.5), rnorm(4000,12,0.5))
input <- data.frame(type,age,stringsAsFactors=FALSE)
# complementary cumulative survival count (CCSC)
range <- seq(floor(min(input$age)),ceiling(max(input$age)),0.5) # range for evaluation
# method "apply"
apply_time <- system.time( {
combns <- expand.grid(age=range,type=LETTERS[1:4], stringsAsFactors=FALSE)
CCSC.apply.all <- apply(combns[1:length(range),],1,function(x){
sum(input$age >= x["age"]) } ) # count survivors of all types
CCSC.apply.type <- apply(combns,1,function(x){
sum( # count survivors of certain type
input["age"] >= x["age"] &
input["type"] == x["type"]
) } )
})
output <- data.frame(range, matrix(c(CCSC.apply.all,CCSC.apply.type), nrow=length(range)))
# method "for loop"
for_time <- system.time( {
CCSC.for.all <- rep(0.0, length(range))
CCSC.for.type <- matrix(rep(0.0, 4*length(range)),nrow=length(range))
for(i in 1:length(range))
{
CCSC.for.all[i] <- sum(input$age >= range[i])
for(j in 1:4)
{
CCSC.for.type[i,j] <-
sum(
input["age"] >= range[i] &
input["type"] == LETTERS[j]
)
}
}
})
output <- cbind(output,CCSC.for.all,CCSC.for.type)
colnames(output) <- c("range",
"all-apply","A-apply","B-apply","C-apply","D-apply",
"all-for","A-for","B-for","C-for","D-for")
cat("\ntime for apply method: ", apply_time)
cat("\ntime for for loop method: ", for_time, "\n\n")
write.table(input, file = "CCSC_input.csv", sep=";", row.names=FALSE, dec=",")
write.table(output, file = "CCSC_output.csv", sep=";", row.names=FALSE, dec=",")
"count the number of elements surviving 4.0 years, 4.5 years, 5.0 years and so on."
Code:
using cut() to get age intervals.
1. By age_range and type:
library('data.table')
df <- setDT(input)[, .N, by = .(age_range = cut(age, range, include.lowest = TRUE), type)]
df[order(age_range),]
# age_range type N
# 1: (4.5,5] A 20
# 2: (5,5.5] A 123
# 3: (5.5,6] A 337
# 4: (6,6.5] A 352
# 5: (6,6.5] B 6
# 6: (6.5,7] A 151
# 7: (6.5,7] B 47
# 8: (7,7.5] A 16
# 9: (7,7.5] B 277
# 10: (7.5,8] A 1
# 11: (7.5,8] B 700
# 12: (8,8.5] B 654
# 13: (8,8.5] C 2
# 14: (8.5,9] B 273
# 15: (8.5,9] C 70
# 16: (9,9.5] B 39
# 17: (9,9.5] C 383
# 18: (9.5,10] B 4
# 19: (9.5,10] C 1023
# 20: (10,10.5] C 1065
# 21: (10,10.5] D 6
# 22: (10.5,11] C 406
# 23: (10.5,11] D 92
# 24: (11,11.5] C 49
# 25: (11,11.5] D 543
# 26: (11.5,12] C 2
# 27: (11.5,12] D 1363
# 28: (12,12.5] D 1334
# 29: (12.5,13] D 561
# 30: (13,13.5] D 92
# 31: (13.5,14] D 8
# 32: (14,14.5] D 1
2. By age_range only:
df <- setDT(input)[, .N, by = .(age_range = cut(age, range, include.lowest = TRUE))]
df[order(age_range),]
# age_range N
# 1: (4.5,5] 20
# 2: (5,5.5] 123
# 3: (5.5,6] 337
# 4: (6,6.5] 358
# 5: (6.5,7] 198
# 6: (7,7.5] 293
# 7: (7.5,8] 701
# 8: (8,8.5] 656
# 9: (8.5,9] 343
# 10: (9,9.5] 422
# 11: (9.5,10] 1027
# 12: (10,10.5] 1071
# 13: (10.5,11] 498
# 14: (11,11.5] 592
# 15: (11.5,12] 1365
# 16: (12,12.5] 1334
# 17: (12.5,13] 561
# 18: (13,13.5] 92
# 19: (13.5,14] 8
# 20: (14,14.5] 1
Data:
type <- c(rep("A",1000), rep("B",2000), rep("C",3000), rep("D",4000))
age <- c(rnorm(1000,6,0.5), rnorm(2000,8,0.5), rnorm(3000,10,0.5), rnorm(4000,12,0.5))
input <- data.frame(type,age,stringsAsFactors=FALSE)
range <- seq(floor(min(input$age)),ceiling(max(input$age)),0.5)
I struggled to understand exactly what you were looking for, so apologies if I've got this wrong. I've tried used split to make it easier to iterate over the type variable, then have used the purrr package to iterate rather than the apply family.
By being more explicit outside the iterating function - e.g. using unique(combns$age) - I think it's easier to understand what is being iterated over. For example, in your original code, I think x["age"] resulted in a character rather than a numeric as you were expecting.
FYI - differences in values are likely due to the use of rnorm in generating the data and not setting a seed.
# split input list by type
input_list <- split(input, type)
# for each type, calculate age >= each unique value of combns$age
purrr::map_df(input_list,
.f = function(y) {
purrr::map_dbl(unique(combns$age),
.f = function(x) sum(y$age >= x))
})
# A tibble: 21 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1000 2000 3000 4000
2 1000 2000 3000 4000
3 970 2000 3000 4000
4 841 2000 3000 4000
5 458 2000 3000 4000
6 149 2000 3000 4000
7 32 1956 3000 4000
8 2 1704 3000 4000
9 0 1022 3000 4000
10 0 340 2997 4000
# … with 11 more rows
I have two tables as follows:
library(data.table)
Input<-data.table("Date"=seq(1:10),"Cycle"=c(90,100,130,180,200,230,250,260,300,NA))
Date Cycle
1: 1 90
2: 2 100
3: 3 130
4: 4 180
5: 5 200
6: 6 230
7: 7 250
8: 8 260
9: 9 300
10: 10 320
FDate<-data.table("Date"=seq(1:9),"Cycle"=c(90,100,130,180,200,230,250,260,300),"Task"=c("D","A","B,C",NA,"A,D","D","C","D","A,C,D"))
Date Cycle Task
1: 1 90 D
2: 2 100 A
3: 3 130 B,C
4: 4 180 <NA>
5: 5 200 A,D
6: 6 230 D
7: 7 250 C
8: 8 260 D
9: 9 300 A,C,D
I just want to have an output table with non-overlapped Date and corrresponding Cycle.
I tried with setdiff but it doesn't work. I expect my output like this
Date Cycle
10 320
When I tried this setdiff(FDate$Date,Input$Date)
it turns like this integer(0)
We can use fsetdiff from data.table by including only the common columns in both datasets
fsetdiff(Input, FDate[ , names(Input), with = FALSE])
# Date Cycle
#1: 10 320
Or a join as #Frank mentioned
Input[!FDate, on=.(Date)]
# Date Cycle
#1: 10 320
In the OP's code,
setdiff(FDate$Date,Input$Date)
the first argument is from the 'Date' column from 'FDate' All of the elements in that column is also in the master data 'Input$Date'. So, it returns integer(0)). If we do the reverse, it would return 10
This question already has answers here:
Reshape three column data frame to matrix ("long" to "wide" format) [duplicate]
(6 answers)
Closed 5 years ago.
I would like to print a table that has the row names from one column and the column names from a second column, and the table values from a third column.
Here's the contents of data table dt. The heading is the ship heading, shipSpdKt is the ship speed and kts, and V1 is the table value.
heading shipSpdKt V1
1: 0 5 -4.3057799
2: 30 5 -4.2452984
3: 60 5 -5.6391077
4: 90 5 -4.9353771
5: 120 5 -4.3519821
6: 150 5 -2.3346472
7: 180 5 -1.6274207
8: 0 10 -6.0007901
9: 30 10 -6.7137480
10: 60 10 -6.9774241
11: 90 10 -5.7268767
12: 120 10 -3.9167585
13: 150 10 -1.8365736
14: 180 10 -1.1103727
15: 0 20 -6.4556379
16: 30 20 -7.3609538
17: 60 20 -11.3018260
18: 90 20 -7.7640429
19: 120 20 -4.7670283
20: 150 20 -1.7899857
21: 180 20 -0.9479594
22: 0 30 -4.2182927
23: 30 30 -5.8362999
24: 60 30 -8.9905834
25: 90 30 -7.2139764
26: 120 30 -5.1285415
27: 150 30 -2.2860508
28: 180 30 -0.8197407
I want something like
print(dt)
to produce a printed table showing the value of V1 at the heading and ship speed values. Here is a partial table showing the desired output.
shipSpdKt
heading 5 10 20 30
0 -4.305 -6.000 -6.455 -4.218
30 -4.245
60 -5.639
90 -4.935
120 -4.351
150 -2.334
180 -1.627
This is my initial attempt, but it doesn't produce the desired table
reshape(dt,v.names='heading',idvar='shipSpdKt',timevar="heading",direction="wide")
The output is
shipSpdKt V1 heading.0 heading.30 heading.60 heading.90 heading.120 heading.150
1: 5 4.505957 0 30 60 90 120 150
2: 10 5.683579 0 30 60 90 120 150
3: 20 6.427269 0 30 60 90 120 150
4: 30 3.961622 0 30 60 90 120 150
heading.180
1: 180
2: 180
3: 180
4: 180
Using the base reshape function:
reshape(dt, timevar = "shipSpdKt", idvar = "heading", direction = "wide")
Using the reshape2 package:
reshape2::dcast(dt, heading ~ shipSpdKt, value.var = "V1")
Using the tidyr package:
tidyr::spread(dt, shipSpdKt, V1)
Using the data.table package:
data.table::dcast.data.table(dt, heading ~ shipSpdKt, value.var = "V1")
This question is very similar to Sample random rows within each group in a data.table.
The difference is in a minor subtlety that I did not have enough reputation to discuss for that question itself.
Let's change Christopher Manning's initial data a little bit:
> DT = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20))
> DT
a b
1: 1 102
2: 1 5
3: 1 658
4: 1 499
5: 2 632
6: 3 186
7: 4 761
8: 5 150
9: 6 423
10: 7 832
11: 8 883
12: 9 247
13: 10 894
14: 11 141
15: 12 891
16: 13 488
17: 14 101
18: 15 677
19: 1 400
20: 1 467
If we tried the question's solution:
> DT[,.SD[sample(.N,3)],by = a]
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
This is because there are values in column that only occur once. We cannot sample 3 times for values that occur less than three times without using replacement (which we do not want to do).
I am struggling to deal with this scenario. We want to sample 3 times when the number of occurrences is >= 3, but pull the number of occurrences if it is < 3. For example with our DT above we would want:
a b
1: 1 102
2: 1 5
3: 1 658
4: 2 632
5: 3 186
6: 4 761
7: 5 150
8: 6 423
9: 7 832
10: 8 883
11: 9 247
12: 10 894
13: 11 141
14: 12 891
15: 13 488
16: 14 101
17: 15 677
Maybe a solution could involve sorting the data.table like this, then using rle() lengths to find out which n to use in the sample function above:
> DT <- DT[order(DT$a),]
> DT
a b
1: 1 102
2: 1 5
3: 1 658
4: 1 499
5: 1 400
6: 1 467
7: 2 632
8: 3 186
9: 4 761
10: 5 150
11: 6 423
12: 7 832
13: 8 883
14: 9 247
15: 10 894
16: 11 141
17: 12 891
18: 13 488
19: 14 101
20: 15 677
> ifelse(rle(DT$a)$lengths >= 3, 3,rle(DT$a)$lengths)
> [1] 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
If we replace "3" with n, this will return how much we should sample from a=1, a=2, a=3...
I have yet to find a way to incorporate this into a final solution. Any help would be appreciated!
I might be misunderstanding your question, but are you looking for something like this?
set.seed(123)
##
DT <- data.table(
a=c(1,1,1,1:15,1,1),
b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
a b
1: 1 288
2: 1 881
3: 1 409
4: 2 937
5: 3 46
6: 4 525
7: 5 887
8: 6 548
9: 7 453
10: 8 948
11: 9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15 42
where we are drawing 3 samples from b for group a_i if a_i contains three or more values, else we draw only n values, where n (n < 3) is the size of group a_i.
Just for demonstration, here are the 6 possible values of b for a=1 that we are sampling from (assuming you use the same random seed as above):
R> DT[order(a)][1:6,]
a b
1: 1 288
2: 1 788
3: 1 409
4: 1 881
5: 1 323
6: 1 996