I'm calculating features for a machine learning algorithm in R using data.table. I'm only going to be making predictions on rows that have a future date, but to calculate each feature, I need to aggregate on a large dataset that can have millions of rows. However, to improve processing speed and performance, I don't need the aggregate to calculate on rows that are for a past date.
In plain terms, I'm trying to use a large dataset to calculate the feature for only the last n rows using the entire dataset and skip rows where the Date is in the past. I have a user defined function that counts the number of rows that are higher than the current row being calculated in the loop. My example data.table below shows the outcome I'm trying to achieve. The row being calculated will count the number of rows higher than it and then move to the next row in the loop. I want it to skip all rows with a past date and only calculate rows with a future date. The current date in this example is 2019-03-20.
Group Date Appt Sum
A 2019-03-18 1 NA
A 2019-03-19 1 NA
A 2019-03-20 1 NA
A 2019-03-21 1 3
A 2019-03-22 1 4
A 2019-03-23 1 5
library(data.table)
dt = structure(list(Group = c("A", "A", "A", "A", "A", "A"), Date = structure(c(17973,
17974, 17975, 17976, 17977, 17978), class = "Date"), Appt = c(1L,
1L, 1L, 1L, 1L, 1L), Sum = c(NA, NA, NA, 3, 4, 5)), row.names = c(NA,
-6L), class = "data.frame")
setDT(dt)
This is the function and code I'm currently using and it works perfectly. The only problem is that it performs the calculation on every row even though I only need the calculation results for a few rows at the end of the dataset that can be in the millions. It's wasting a ton of processing power and time by making calculations that will be excluded from the prediction set.
rollingSum <- function(i, data, count, dates) {
z <- with(data[i, ], zoo(count, dates))
g <- zoo(, seq(start(z), end(z), by="day"))
m <- merge(z, g)
window(rollapplyr(m, 365, sum, na.rm=TRUE, partial=TRUE), time(z))
}
dt[, Sum := as.numeric(rollingSum(data=dt, count=Appt, dates=Date) - Appt), by=Group]
here is some more details for my comment above:
dt <- data.table(
Group = "A",
Date = as.IDate(c("2019-03-18", "2019-03-19", "2019-03-20",
"2019-03-21", "2019-03-22","2019-03-23")),
Appt = 1)
microbenchmark(
dt[, Sum := as.numeric(rollingSum(data=dt, count=Appt, dates=Date) - Appt), by=Group],
dt[, Sum2 := ifelse(Date > as.IDate("2019-03-20"), (1:.N) - Appt , as.numeric(NA)), by = Group],unit = "ms")
here are the benchmarks:
Unit: milliseconds
expr min lq mean median uq max neval
rollingSum 3.463955 4.0644910 18.748804 4.353562 4.745325 1395.840823 100
new func 0.768079 0.8757095 1.258782 1.015766 1.140316 8.275985 100
Related
I've tried the various answers so far here:
Combining IRanges objects and maintaining mcols
Find all date ranges for overlapping start and end dates in R
Find groups of overlapping intervals with data.table
Finding all overlaps in one iteration of foverlap in R's
data.table
Find dates within a period interval by group
R Find overlap among time periods
Detect overlapping dates by group with R
Some work but are not very performant for very large datasets (8m-12m rows)
Just some sample code of what I've been trying:
library(tidyverse)
library(data.table)
size = 10000
df <- data.frame(
ID = sample(1:round(size / 5, 0)),
period = sample(c(5,10,30,45), size, replace = TRUE),
start = sample(seq(
as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
), size, replace = TRUE)
) %>% mutate(end = start + period)
dt <-
data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]
overlapping <-
unique(foverlaps(dt, dt)[ID == i.ID & row != i.row, ID])
dt[, `:=`(Overlap = FALSE)][ID %in% overlapping, Overlap :=
TRUE][order(ID, start)] %>%
distinct(ID,Overlap) %>%
count(Overlap) %>%
mutate(freq = n/sum(n))
This one works fine but if the dataset gets bigger it's either slow or there is a negative vector error:
Error in foverlaps(dt, dt) : negative length vectors are not allowed
Is there a better way?
You could directly join by ID in foverlaps and count number of overlaps :
size = 1e5
df <- data.frame(
ID = sample(1:round(size / 5, 0)),
period = sample(c(5,10,30,45), size, replace = TRUE),
start = sample(seq(
as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
), size, replace = TRUE)
) %>% mutate(end = start + period)
dt <- data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]
setkey(dt,ID,start,end)
foverlaps(dt,dt,by.x=c("ID","start","end"),by.y=c("ID","start","end"))[
,.(noverlap=.N),by=.(ID,row)][
,.(overlap = max(noverlap>1)),by=ID][
,.(n=.N),by=.(overlap)][
,pct:=n/sum(n)][]
Overlap n freq
1: FALSE 547 0.2735
2: TRUE 1453 0.7265
Performance comparison :
microbenchmark::microbenchmark(old(),new())
Unit: milliseconds
expr min lq mean median uq max neval
old() 672.6338 685.8825 788.78851 694.7804 864.95855 1311.9752 100
new() 16.9942 17.7659 24.66032 18.7095 20.59965 63.3928 100
I currently have a data set with quarterly returns for 10 indices. My dataset (compoundrates) is structured so that in the first column, we have "Scenario" and the second column is "Quarter", and the following 10 are the quarterly index return. The projection is 50 quarters, so lines 1-51 reflect quarters 0-50 for scenario 1, and lines 52-102 reflect quarter 0-50 for scenario 2, etc for 1000 scenarios.
To calculate cumulative compound rates, I need to multiply the current return by all previous returns from the projection. I set up a loop to do this in the code below:
for(i in 1:nrow(compoundrates)){
if(compoundrates[i, "Quarter"] == 0){
compoundrates[i, -c(1:2)] <- 1
} else{
compoundrates[i, -c(1:2)] <- compoundrates[i, -c(1:2)] * compoundrates[i - 1, -c(1:2)]
}
}
The loop is simple and works how I want. However, with 51000 rows, this takes about 13 minutes. Is there a way to speed up the code? I tried thinking of a vectorized solution, but could only think that I would need to loop through all rows of the dataset. While 13 minutes is not the end of the world, I have other datasets with longer projections, up to 200 quarters, which would take extremely long.
Possibly pivoting the dataset to be horizontal would require only 50 loops rather than 51000, but thought I'd see if anyone else had a more elegant solution.
Edit: Included here is a sample of the first couple of lines of my dataset:
> dput(head(compoundrates[, 1:4])) # First part of data, only 2 indices
structure(list(Scenario = c(1L, 1L, 1L, 1L, 1L, 1L), Quarter = c(0,
1, 2, 3, 4, 5), US = c(1, 1.06658609144463, 1.1022314574062,
1.1683883540847, 1.29134306037902, 1.28907212981088), MidCap = c(1,
1.10361590084936, 1.12966579678275, 1.21702573464001, 1.2674372889915,
1.37286942499386)), row.names = c(NA, -6L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), groups = structure(list(Scenario = 1L,
.rows = list(1:6)), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
Try this out, it uses vectorized functions, basically exactly what you are trying to do with a for loop. It create new columns so you can see what is going on. Vectorized function usually run a lot faster than for loops
library(tidyverse)
compoundrates %>%
group_by(Scenario) %>%
arrange(Quarter) %>%
mutate(US_lag = lag(US),
MidCap_lag = lag(MidCap),
US_cum = US *US_lag,
MidCap_cum = MidCap *MidCap_lag) %>%
mutate_all(~ifelse(is.na(.), 1, .))
This should do the cumulative product you were asking for
compoundrates %>%
group_by(Scenario) %>%
arrange(Quarter) %>%
mutate(US_cum = cumprod(US),
MidCap_cum = cumprod(MidCap))
Per #carl-witthoft suggestion here is the benchmarking. I made a dataframe of 60,000 rows grouped by the 6 quarters in the OP.
Unit: milliseconds
expr
big_data %>%
group_by(Scenario) %>%
mutate(US_cum = cumprod(US),
MidCap_cum = cumprod(MidCap))
min lq mean median uq max neval
63.1487 70.0257 77.16906 73.72995 79.7645 147.167 100
I have two separate datasets: df1 and df2. I would like to create a new dataset, df3 that would match the endtime column of df1 with the sent column of df2 if the datetimes are within 20 seconds of each other.
df1
endtime ID
1/7/2020 1:35:08 AM A
1/7/2020 1:39:00 AM B
1/20/2020 1:45:00 AM C
df2
sent ID
1/7/2020 1:35:20 AM E
1/7/2020 1:42:00 AM F
1/20/2020 1:55:00 AM G
1/20/2020 2:00:00 AM E
This is my desired output for df3. There is only one row, because there are only two values that match the condition of being within 20 seconds of the endtime and sent columns.
endtime sent
1/7/2020 1:35:08 AM 1/7/2020 1:35:20 AM
Here is the dput:
df1
structure(list(endtime = structure(c(2L, 3L, 1L), .Label = c("1/10/2020 1:45:00 AM",
"1/7/2020 1:35:08 AM", "1/7/2020 1:39:00 AM"), class = "factor"),
ID = structure(1:3, .Label = c("A", "B", "C"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
df2
structure(list(sent = structure(c(3L, 4L, 1L, 2L), .Label = c("1/20/2020 1:55:00 AM",
"1/20/2020 2:00:00 AM", "1/7/2020 1:35:20 AM", "1/7/2020 1:42:00 AM"
), class = "factor"), ID = structure(c(1L, 2L, 3L, 1L), .Label = c("E",
"F", "G"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
This is what I have tried:
I am thinking of performing a left join and matching the values, or I can use merge(), but the tricky part is matching the values with the conditional statement. Any suggestion is appreciated.
library(dplyr)
left_join(df1, df2)
Since there is no common column to join, we can use crossing to create all combinations of rows and then filter the ones which fit the criteria.
library(dplyr)
df1 %>%
rename(ID1 = 'ID') %>%
tidyr::crossing(df2) %>%
mutate_at(vars(endtime, sent), lubridate::mdy_hms) %>%
filter(abs(difftime(sent, endtime, 'secs')) < 20)
# endtime ID1 sent ID
# <dttm> <fct> <dttm> <fct>
#1 2020-01-07 01:35:08 A 2020-01-07 01:35:20 E
You can also do it like this if your datasets are too big to make the cartesian product:
df1 %>%
split(1:NROW(.)) %>%
map( ~merge(.x,
df2[ abs(difftime(df2$sent, .x$endtime, units='s')) < 20, ],
by=NULL) ) %>%
bind_rows()
EDIT
TLDR
Use the non-equi join from data.table, it has the best performance overall.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
dt1[, `:=`(endtime_min = endtime - 20, endtime_max = endtime + 20) ]
dt1[dt2,
.(ID, ID1, endtime, sent),
on = .(endtime_min < sent, endtime_max > sent), nomatch = 0L, allow.cartesian=T]
Longer version
The answer I posted would be better in scenarios where the data frames are too big, since doing the cross join first yields a data frame with as many rows as the product of the number of rows from both data frames. By first filtering and after joining, it avoids the unnecessary memory allocation. However, it has the overhead of for each row of df1, checking if there are any rows to match in the df2.
Another use case where this answer would be better is when one of the data frames is much smaller than the other, even if they aren't that big. I ran some benchmarks to check this.
However, after coming across this answer, and making the version of the solution in data.table for the question presented by the OP, none of the answers compares to the performance of this implementation.
The tests I ran used the datasets given by the OP, where to simulate a larger dataset I simply replicated those datasets a certain amount of times. There were 2 tests that I did:
Replicated both datasets the same amount of times
Fixed the size of df1 and replicated df2
For each test, I measured the median execution time for the accepted answer (merge_filter), my original answer (filter_merge) and the data.table solution (datatable).
Prior to running the tests, I prepared both df1 and df2 to have the right datatypes, and renamed the column ID from df1 to ID1. For the data.table solution, I converted both data frames into their data.tables counterparts, dt1 and dt2.
Regarding each method, I have to do some changes, mainly using merge(..., by=NULL) instead of crossing(...) since the last one doesn't support cross joining with duplicate rows, removing all the duplicate rows from the resulting dataset.
Here is the code I used to run the tests:
library(tidyverse)
library(data.table)
run_test = function(n, n1=n, n2=n) {
df1 = bind_rows(rep(list(df1_op), n1))
df2 = bind_rows(rep(list(df1_op), n2))
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
microbenchmark::microbenchmark(
merge_filter = df1 %>%
merge(df2, by=NULL) %>%
filter(abs(difftime(sent, endtime, 'secs')) < 20),
filter_merge = df1 %>%
split(1:NROW(.)) %>%
map(~merge( .x,
df2[ abs(difftime(df2$sent, .x$endtime, units='s')) < 20, ],
by=NULL) ) %>%
bind_rows(),
datatable={
dt1[, `:=`(endtime_min = endtime - 20, endtime_max = endtime + 20) ]
dt1[dt2,
.(ID, ID1, endtime, sent),
on = .(endtime_min < sent, endtime_max > sent), nomatch = 0L, allow.cartesian=T]
}
)
}
test_1_list = list()
for( n in c(1, 2, 5, 10, 20, 50, 100, 200, 500) ) {
test_1_list[[ toString(n) ]] <- run_test(n)
}
test_2_list = list()
for( n in c(1, 2, 5, 10, 20, 50, 100, 200, 500,
1000, 2000, 5000, 10000, 20000, 50000) ) {
test_2_list[[ toString(n) ]] <- run_test(n, n1=1)
}
And here are the results for test 1 and 2 respectively:
EDIT 2
You can do a non-equi left join like this:
filter_merge
df1 %>%
split(1:NROW(.)) %>%
map( ~merge(mutate(.x, k=1),
df2 %>%
filter( abs(difftime(df2$sent, .x$endtime, units='s')) < 20 ) %>%
mutate(k=1),
by="k",
all.x=T) %>%
select(-k) ) %>%
bind_rows() %>%
select(ID1, endtime, ID, sent)
# ID1 endtime ID sent
# 1 A 2020-01-07 01:35:08 E 2020-01-07 01:35:20
# 2 B 2020-01-07 01:39:00 <NA> <NA>
# 3 C 2020-01-10 01:45:00 <NA> <NA>
datatable
dt1[, `:=`(endtime_min = endtime - 20, endtime_max = endtime + 20) ]
dt2[dt1,
.(i.ID1, i.endtime, x.ID, x.sent),
on = .(sent > endtime_min, sent < endtime_max), allow.cartesian=T]
# i.ID1 i.endtime x.ID x.sent
# 1: A 2020-01-07 01:35:08 E 2020-01-07 01:35:20
# 2: B 2020-01-07 01:39:00 <NA> <NA>
# 3: C 2020-01-10 01:45:00 <NA> <NA>
I hope this is an acceptable R/data.table problem.
I have a 3-column table with:
id geographic location IDs (303,453 locations)
month month over 25 years 1990-2014
spei a climatic index that varies between -7 and 7.
I need to count the occurrence of droughts at each location over the entire 1990-2014 period. A drought event is defined as "a period in which the SPEI is continuously negative and the SPEI reaches a value of -1.0 or less. Drought starts when the SPEI first falls below zero and ends with the first positive SPEI value following a value of -1.0 or less".
I know this should be feasible using shift() and rolling joins but would very welcome some help!
# Sample table structure
dt <- data.table(
id = rep(1:303453, each=25*12),
month = rep(seq(as.Date("1990-01-01"), as.Date("2014-12-31"), "month"), 303453),
spei = runif(303453*25*12, -7, 7))
# A minimal example with 1 location over 12 months
library(data.table)
library(xts)
dt <- data.table(
id = rep("loc1", each=12),
month = seq(as.Date("2014-01-01"), as.Date("2014-12-31"), "month"),
spei = c(-2, -1.1, -0.5, 1.2, -1.2, 2.3, -1.7, -2.1, 0.9, 1.2, -0.9, -0.2))
spei.ts <- xts(dt$spei, order.by=dt$month, frequency="month")
plot(spei.ts, type="bars")
This shows 3 drought events over a 1-year period. This is what I need to identify and count.
Hoping some of you are more used to working with time series.
Many thanks, --Mel.
Here is a starting point to get the result you want.
Probably experts can suggest improvements in speed.
EDIT: improved speed ~8x by removing paste.
library(data.table)
set.seed(42)
n <- 300 # 303453 will be ~1000 times slower
dt <- data.table(
id = rep(1:n, each=25*12),
month = rep(seq(as.Date("1990-01-01"), as.Date("2014-12-31"), "month"), n),
spei = runif(n*25*12, -7, 7))
system.time({
dt[, `:=`(neg = (spei < 0), neg1 = (spei <= -1))]
dt[, runid := ifelse(neg, rleid(neg), NA)]
res <- dt[!is.na(runid),
.(length = .N[any(neg1)], start = min(month), end = max(month)),
by = .(id, runid)][!is.na(length)]
})
# user system elapsed
# 0.345 0.000 0.344
# counts of droughts per id:
res[, .(nDroughts = .N), by = id]
# list of droughts per id: (NB: don't include 1st positive value after)
res[, .(droughtN = seq_len(.N), start, end), by = id]
Update based on comment...
If all that was needed was the counts then
# Let 'sp' = starting point of potential drought
# Let 'dv' = drought level validation
# The cumsum just gives unique ids to group by.
dt[, sp := (spei <= 0) & (shift(spei, fill = 1) > 0), by = id]
dt[, dv := min(spei) <= -1, by = .(id, cumsum(sp))]
dt[sp & dv, .N, by = id]
yet, as stated in the comments, you've already been there, so you've seen how shift can be used. Since you like the idea of identifying the dates as well. Why not use shift there as well?
# Extending the previous columns...
dt[, ep := (shift(spei, type = "lead", fill = 1) > 0) & (spei <= 0), by = id]
cbind(dt[sp & dv, .(start = month), by = id],
dt[ep & dv, .(end = month), by = id][,id := NULL])
If you wanted the dates to be as indicated by the red lines in the plot just add a month unless its the last one. We can also get the lengths too...
# Extending the previous columns again...
dt[, end.month := shift(month, type = "lead", fill = month[.N]), by = id]
dt[, orig.id := .I]
starts <- dt[sp & dv][, did := .I]
ends <- dt[ep & dv][, did := .I]
starts[ends, on = "did"][
,.(id = id, length = 1 + i.orig.id - orig.id, start = month, end = i.end.month)]
Would yield
id length start end
1: loc1 3 2014-01-01 2014-04-01
2: loc1 1 2014-05-01 2014-06-01
3: loc1 2 2014-07-01 2014-09-01
And it is still fast! With n=300
> microbenchmark(max = max.full(copy(dt))[, .(nDroughts = .N), by = id],
+ thellcounts = thell.counts(copy(dt)),
+ thell .... [TRUNCATED]
Unit: milliseconds
expr min lq mean median uq max neval
max 218.19152 220.30895 342.18605 222.75507 250.36644 1350.15847 10
thellcounts 20.36785 22.27349 28.45167 23.39313 24.38610 78.25046 10
thelldates 28.24378 28.64849 30.59897 30.57793 31.25352 34.51569 10
thelldates2 36.19724 39.79588 42.34457 41.52455 42.41872 57.28073 10
With n=3000
> microbenchmark(max = max.full(copy(dt))[, .(nDroughts = .N), by = id],
+ thellcounts = thell.counts(copy(dt)),
+ thell .... [TRUNCATED]
Unit: milliseconds
expr min lq mean median uq max neval
max 2126.1138 2148.3453 2207.7801 2205.3536 2241.2848 2340.1203 10
thellcounts 197.7312 202.4817 234.2949 205.4828 304.1556 309.1028 10
thelldates 261.9889 264.5597 283.9970 266.1244 267.8603 374.6406 10
thelldates2 320.6352 331.7558 374.4110 340.2668 439.1490 441.8473 10
I want to calculate the mean and standard deviation, by group, for each column in a subset of a large data frame.
I'm trying to understand why some of the answers to similar questions aren't working for me; I'm still pretty new at R and I'm sure there are a lot of subtleties (and not-so-subtle things!) I'm completely missing.
I have a large data frame similar to this one:
mydata <- data.frame(Experiment = rep(c("E1", "E2", "E3", "E4"), each = 9),
Treatment = c(rep(c("A", "B", "C"), each = 3), rep(c("A", "C", "D"), each = 3), rep(c("A", "D", "E"), each = 3), rep(c("A", "B", "D"), each = 3)),
Day1 = sample(1:100, 36),
Day2 = sample(1:100, 36),
Day3 = sample(1:150, 36),
Day4 = sample(50:150, 36))
I need to subset the data by Experiment and by Treatment, for example:
testB <- mydata[(mydata[, "Experiment"] %in% c("E1", "E4"))
& mydata[, "Treatment"] %in% c("A", "B"),
c("Treatment", "Day1", "Day2", "Day4")]
Then, for each column in testB, I want to calculate the mean and standard deviation for each Treatment group.
I started by trying to use tapply (over just one column to begin with), but get back "NA" for Treatment groups that shouldn't be in testB, which isn't a big problem with this small dataset, but is pretty irksome with my real data:
>tapply(testB$Day1, testB$Treatment, mean)
A B C D E
70.66667 61.00000 NA NA NA
I tried implementing solutions from Compute mean and standard deviation by group for multiple variables in a data.frame. Using aggregate worked:
ag <- aggregate(. ~ Treatment, testB, function(x) c(mean = mean(x), sd = sd(x)))
But I can't get the data.table solutions to work.
library(data.table)
testB[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by = Treatment]
testB[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = Treatment]
both gave me the error message
Error in `[.data.frame`(testB, , c(mean = lapply(.SD, mean), sd = lapply(.SD, :
unused argument(s) (by = Treatment)
What am I doing wrong?
Thanks in advance for helping a clueless beginner!
Your columns are factors. Although you've dropped the rows that have the treatments "C", "D", and "E" in your subset testB, those levels still exist. Use levels(testB) to see them. You can use the droplevels function when defining your testB subset to allow you to get means for A and B without returning NAs for empty factor levels.
testB <- droplevels(mydata[(mydata[, "Experiment"] %in% c("E1", "E4"))
& mydata[, "Treatment"] %in% c("A", "B"),
c("Treatment", "Day1", "Day2", "Day4")]
tapply(testB$Day1,testB$Treatment,mean)
A B
59.16667 66.00000
Hope this helps!
Ron
You could use plyr and reshape2 to tackle this problem as well; I generally prefer to use these libraries because the abstractions they introduce apply to more problems, and are cleaner.
How I would solve it:
library(plyr)
library(reshape2)
# testB from your code above
# make a "long" version of testB
longTestB <- melt(testB, id.vars="Treatment")
# then use ddply for calculating your metrics
ddply(longTestB, .(Treatment), summarize, mean=mean(value), stdev=sd(value))