I have just started using R and wrote the following code but this is taking about 40 mins to process, so I am sure this can be coded in a way that it runs faster.
Basically, I have one large data set (crsp_td_net) of about 7GB n size and a second smaller data set (ff_35f). Both contain trading dates. What I am trying to do is to fill in trading dates for each company in the first data set.
From my first data set, I am creating subsets of data based on a company index, and then merging each subset with the second data set based on trading dates. This merged data set is appended to the other companies data set and so on until at the end, I am left with a large data set with all the initial companies but with the missing trading days incorporated.
I am not sure at this point whether the fact that the data frame final keeps expanding is causing my loop to run slowly or whether the loop is itself coded inefficiently. I understand that vectorization of the data can help speed this up, but I am not sure how to do this here (the matrix size of the subsets of data keeps changing for each company). I am also not sure of the best way to use apply, sapply or lapply (if any of these can be used here) for this. I have browsed a few queries on R but was I have not found a way to go about this. I would very much appreciate an alternative snippet of code that can make the below run faster.
todo<-matrix(numeric(0), 0,4)
for (i in 1:7396) {
final<- crsp_td_net %>%
filter(compid==i) %>%
merge(ff_35f,by="date_crsp",all=TRUE)
final<-final%>% filter(between(date_crsp,
as.Date(min(date_crsp_orig,na.rm="TRUE")),
as.Date(max(date_crsp_orig, na.rm="TRUE")))) %>%
arrange(date_crsp) %>%
mutate(cusip8dg_compustat =
ifelse(is.na(cusip8dg_compustat),
max(cusip8dg_compustat, na.rm="TRUE"),
cusip8dg_compustat)) %>%
mutate(compid = ifelse(is.na(compid), i, compid))%>%
select(compid, cusip8dg_compustat, date_crsp,
date_crsp_orig)%>%
distinct()
todo<-bind_rows(todo,final)
}
Thanks in advance,
Dev
Thank you all for your response. I was unable to reply in the comment box due to limit on response, so I am adding to my original post. #P Lapointe, please find a reproducible data set (I have used integer values instead of actual dates) #eipi10 - I think you have understood what I am after and thanks for the code but I am not sure if it is missing something as it is prompting for an input (I have all relevant libraries). #Alistaire - I will indeed be facing memory problems as I perform more calculations to add to the original data set. Grateful for your suggestions on how to make the loop faster/an alternative to it, which would be very helpful to understand how they would be implemented in the example below.
many thanks
zz <- "compid date_crsp
1 1 2
2 1 3
3 1 5
4 2 3
5 2 7
6 2 9
7 3 3
8 3 5
9 3 7
10 3 8"
crsp_td_net <- read.table(text=zz, header = TRUE)
xx <- "date_crsp
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11"
ff_35f <- read.table(text=xx, header = TRUE)
# I expect my final output to look like this:
yy<-"compid date_crsp
1 1 2
2 1 3
3 1 4
4 1 5
5 2 3
6 2 4
7 2 5
8 2 6
9 2 7
10 2 8
11 2 9
12 3 3
13 3 4
14 3 5
15 3 6
16 3 7
17 3 8"
output_wanted<-read.table(text=yy, header = TRUE)
df <- full_join(crsp_td_net, expand.grid(compid = unique(crsp_td_net$compid), date_crsp=unique(ff_35f$date_crsp)))
todo<-array(numeric(),c(1,4,0))
todo<-matrix(numeric(0), 0,0)
for (i in 1:3) {
final<- filter(crsp_td_net,compid==i)
final<- mutate(final,date_crsp_orig=date_crsp)
final<- merge(final,ff_35f, by="date_crsp",all=TRUE)
final<- filter(final,between(date_crsp, min(date_crsp_orig, na.rm=TRUE), max(date_crsp_orig, na.rm=TRUE)))
final<- arrange(final,date_crsp)
final<- mutate(final,compid = ifelse(is.na(compid), i, compid))
final<- select(final,compid, date_crsp)
final<- distinct(final)
todo<-bind_rows(todo,final)
}
I have modified the full_join example and it now runs but is not doing what I want it to do re merging each compid with unique trading days to fill in missing trading days in the first data set. I would very much appreciate any suggestion on this please.
The loop I wrote above works to give me exactly what I want, but I was wondering if there is a faster way to do this as I will have to loop over 7000 or so compid to create the large data set todo. This takes about 40 mins to run, so I wonder if there is a faster way to write this loop or an alternative to it.
Many thanks in advance
crsp_td_net$date_crsp_orig <-crsp_td_net$date_crsp
df <- full_join(crsp_td_net, by="date_crsp", expand.grid(compid = unique(crsp_td_net$compid), date_crsp=unique(ff_35f$date_crsp)) )
df<- df%>% filter(between(date_crsp, min(date_crsp_orig, na.rm=TRUE), max(date_crsp_orig, na.rm=TRUE)))
df<- df%>%filter(!compid.x=="NA")%>% select(-compid.y)%>% distinct()%>%arrange(compid.x,date_crsp)
Although the OP has asked for a dplyr solution, I can only suggest a solution which uses the foverlaps() function from the data.table package.
The OP has requested to complete the trading dates for each company in crsp_td_net by trading dates given in ff_35f. Completion means to fill up date ranges from a start date to an end date with given dates. (Note that the OP is using integer values in place of dates). The given dates can be considered to be date ranges as well where each range consists only of one day.
Now, the problem has been paraphrased to find the overlaps of two sequences of (date) ranges (overlap joins). For this, the foverlaps() function can be used which is inspired by findOverlaps()function of Bioconductor's IRanges package but works with non-genomic (i.e., non-integer) ranges as well.
library(data.table)
# coerce to data.table
setDT(crsp_td_net)
setDT(ff_35f)
# find start and end date for each company
comp_date_range <- crsp_td_net[, .(start = min(date_crsp), end = max(date_crsp)),
by = compid]
# turn given dates into date ranges of one day lengths
# by adding an end column equal to the start dates
ff_35f[, end := date_crsp]
# set keys
setkey(comp_date_range, start, end)
setkey(ff_35f, date_crsp, end)
# find all overlapping ranges
temp <- foverlaps(comp_date_range, ff_35f)
# reorder result for convenience and pick desired columns
result <- temp[order(compid, date_crsp), .(compid, date_crsp)]
The result is in line with the expected output:
result
# compid date_crsp
# 1: 1 2
# 2: 1 3
# 3: 1 4
# 4: 1 5
# 5: 2 3
# 6: 2 4
# 7: 2 5
# 8: 2 6
# 9: 2 7
#10: 2 8
#11: 2 9
#12: 3 3
#13: 3 4
#14: 3 5
#15: 3 6
#16: 3 7
#17: 3 8
This can be written more concisely in a single line of code:
foverlaps(
setkey(setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)),
by = compid], start, end),
setkey(setDT(ff_35f)[, .(date_crsp, end = date_crsp)], date_crsp, end)
)[order(compid, start), .(compid, date_crsp)]
Note The OP has replaced dates by integers in his Q. The data.table package offers Date and time classes with integer storage for fast sorting and grouping (see ?as.IDate).
Giving a second thought to this problem, I believe it can be solved at reasonable speed using data.tables' non-equi join. (I'm posting this a separate answer because the approach is quite different to foverlaps().)
library(data.table)
# coerce to data.table
setDT(crsp_td_net)
setDT(ff_35f)
# find start and end date for each company
comp_date_range <- crsp_td_net[, .(start = min(date_crsp), end = max(date_crsp)),
by = compid]
# non equi join: the result contains only rows which fulfill the condition in on = ...
# by = .EACHI executes .SD for each group, returning matching rows for each date
# nomatch = 0 (inner join) skips dates without matching company
temp <- comp_date_range[ff_35f, on = c("start<=date_crsp", "end>=date_crsp"),
.SD, by = .EACHI, nomatch = 0, allow.cartesian = TRUE]
# reorder result for convenience and pick desired columns
result <- temp[order(compid, start), .(compid, date_crsp = start)]
The result is in line with expected output
result
# compid date_crsp
# 1: 1 2
# 2: 1 3
# 3: 1 4
# 4: 1 5
# 5: 2 3
# 6: 2 4
# 7: 2 5
# 8: 2 6
# 9: 2 7
#10: 2 8
#11: 2 9
#12: 3 3
#13: 3 4
#14: 3 5
#15: 3 6
#16: 3 7
#17: 3 8
Note that there is an implicit assumption that the range of dates given ff_35f covers the whole range of dates used in crsp_td_net. Otherwise, company trades would drop off the result.
Benchmark results
At the time of writing, three different solutions were posted. The OP has measured the elapsed times of all three solutions with his 7 Gb data set and reported the measured elapsed times:
1.12 seconds for the foverlaps() solution
1.41 seconds for the non-equi join solution
11.92 seconds for the tidyr/ dplyr solution
in comments here and here.
As I was quite surprised to find the foverlaps() solution to be faster than the non-equi joins so I ran some benchmarks with varying problem sizes using the microbenchmark package.
The problem size is given by the number of companies. For each company, the trading days are randomly sampled from a selection of 260 "dates" simulating one year without weekends (For details see code below). The data set contains about 130 rows per company on average.
As can be seen from the chart of my own benchmarks (note that both axes are in log scale)
foverlaps() is somewhat faster than non-equi joins for larger problem sizes while non-equi joins is the fastest method for smaller problem sizes. tidyr/ dplyr is almost always the slowest method and a magnitude slower on large problems.
Define function for benchmark runs of problem size n_comp
bm_run <- function(n_comp) {
# define 1 year of trading dates, simulating weekends
ff_35f <- sort(outer(1:5, 7*(0:51), `+`))
# create tradings dates for each company
crsp_td_net <- rbindlist(lapply(seq_len(n_comp), function(i) {
# how many trading dates to sample for actual company?
n_days <- sample(length(ff_35f), 1)
# sample trading dates
data.frame(compid = i,
date_crsp = sort(sample(ff_35f, n_days)))
}))
# coerce to data.frame
setDF(crsp_td_net)
# turn vector of trading dates into data.frame
ff_35f <- data.frame(date_crsp = ff_35f)
# scale down number of repetitions with problem size
n_times <- as.integer(scales::squish(1000*1000 / nrow(crsp_td_net), c(3, 1000)))
print(sprintf("%i companies with a total of %i trading dates, %i runs",
n_comp, nrow(crsp_td_net), n_times))
# do the benchmark runs for this problem size
mb <- microbenchmark::microbenchmark(
foverlaps = {
foverlaps(
setkey(setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)),
by = compid], start, end),
setkey(setDT(ff_35f)[, .(date_crsp, end = date_crsp)], date_crsp, end)
)[order(compid, start), .(compid, date_crsp)]
},
non_equi_join = {
setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)), by = compid
][setDT(ff_35f), on = c("start<=date_crsp", "end>=date_crsp"),
.SD, by = .EACHI, nomatch = 0, allow.cartesian = TRUE
][order(compid, start), .(compid, date_crsp = start)]
},
dplyr = {
setDF(crsp_td_net)
setDF(ff_35f)
crsp_td_net %>%
dplyr::group_by(compid) %>%
dplyr::summarize(date_crsp = list(seq(from=min(date_crsp), to=max(date_crsp), by=1))) %>%
tidyr::unnest() %>%
dplyr::semi_join(ff_35f, by="date_crsp") %>%
dplyr::arrange(compid, date_crsp)
},
times = n_times
)
# return problem size and timings as list
return(list(n_comp, nrow(crsp_td_net), mb))
}
Run benchmark for different problem sizes
library(data.table)
library(magrittr)
# number of companies
n_comp <- outer(c(1,2), 10^(1:4), `*`)
# set seed of RNG for creation of reproducible data
set.seed(1234)
# do benchmark runs with different problem size derived from number of companies
bm <- lapply(n_comp, bm_run)
Prepare data for plotting
# create data.table with benchmark timinings from chunks in returned list
mbl <- rbindlist(lapply(bm, `[[`, i = 3), id = "n_row")
# aggregate results
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
# reorder factor levels
mba[, expr := forcats::fct_reorder(expr, -median_time)]
# replace chunk number by number of rows
mba[, n_row := unlist(lapply(bm, `[[`, i = 2))[n_row]]
Creat chart
library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) +
geom_point() + geom_smooth(se = FALSE) +
scale_x_log10(breaks = unique(mba$n_row), labels = scales::comma) +
scale_y_log10() +
xlab("number of rows") + ylab("median of execution time [ms]") +
ggtitle("microbenchmark results") + theme_bw()
Adapted your data to use actual dates. In the data 2017-01-04 and -06 are not in the date table. This approach generates a sequence from the companies first and last date. On compid 2, the filling in of the missing dates can be seen. `seq.Date(from= , to=, by=1) makes the missing dates.
The unnest probably creates a large data frame, so there is some risk on memory, but if you keep operations on these tables to just be the compid and date_crsp then maybe it will fit.
semi_join and inner_join should both work - you want to test for speed.
zz <- "compid date_crsp
1 1 2017-01-02
2 1 2017-01-03
3 1 2017-01-05
4 2 2017-01-03
5 2 2017-01-07
6 2 2017-01-09
7 3 2017-01-03
8 3 2017-01-05
9 3 2017-01-07
10 3 2017-01-08"
crsp_td_net <- read.table(text=zz, header = TRUE)
library(lubridate)
crsp_td_net$date_crsp <- ymd(crsp_td_net$date_crsp)
xx <- "date_crsp
1 2017-01-02
2 2017-01-03
3 2017-01-05
4 2017-01-07
5 2017-01-08
6 2017-01-09
7 2017-01-10"
ff_35f <- read.table(text=xx, header = TRUE)
ff_35f$date_crsp <- ymd(ff_35f$date_crsp)
library(dplyr)
library(tidyr)
crsp_td_net_summary <- crsp_td_net %>%
group_by(compid) %>%
summarize(date_crsp = list(seq.Date(from=min(date_crsp), to=max(date_crsp), by=1))) %>%
unnest() %>%
semi_join(ff_35f, by="date_crsp") %>%
arrange(compid, date_crsp)
crsp_td_net_summary
# # A tibble: 12 × 2
# compid date_crsp
# <int> <date>
# 1 1 2017-01-02
# 2 1 2017-01-03
# 3 1 2017-01-05
# 4 2 2017-01-03
# 5 2 2017-01-05
# 6 2 2017-01-07
# 7 2 2017-01-08
# 8 2 2017-01-09
# 9 3 2017-01-03
# 10 3 2017-01-05
# 11 3 2017-01-07
# 12 3 2017-01-08
Related
This is my first question here in a long time :).
I've got a data frame with data about patient visits to a clinic.
visit_id <- c(1,2,3,4,5,6,7,8,9,10)
patient_id <- c(1,2,1,1,3,2,1,4,5,6)
visit_date <- as.Date(c('2016-12-02','2016-12-02','2016-12-30',
'2016-12-15','2016-12-30','2017-02-01',
'2017-02-15','2017-02-10','2017-01-15','2017-03-01'))
df <- data.frame(visit_id,patient_id,visit_date,visits_previous_20_weeks)
It looks like this:
visit_id patient_id visit_date
1 1 1 2016-12-02
2 2 2 2016-12-02
3 3 1 2016-12-30
4 4 1 2016-12-15
5 5 3 2016-12-30
6 6 2 2017-02-01
7 7 1 2017-02-15
8 8 4 2017-02-10
9 9 5 2017-01-15
10 10 6 2017-03-01
I want to add one more column that would indicate the number of times the patient has been to the clinic in the last 20 weeks:
visit_id patient_id visit_date visits_previous_20_weeks
1 1 1 2016-12-02 0
2 2 2 2016-12-02 0
3 3 1 2016-12-30 2
4 4 1 2016-12-15 1
5 5 3 2016-12-30 0
6 6 2 2017-02-01 1
7 7 1 2017-02-15 3
8 8 4 2017-02-10 0
9 9 5 2017-01-15 0
10 10 6 2017-03-01 0
The only data source is this table. So in the beginning of the table, since this is the first record, patient 1 has been to the clinic 0 times. But on the December 15th, 2016, the patient comes back to the clinic. So the number of visits in the previous 20 weeks (as of that date) is 1.
One inefficient way to do this would be to create a loop that for each row in the data frame, would go through the whole data frame and tally the number of visits for same patient in the previous 20 weeks. Any better way to do this in R?
Thanks :)
Here's a way using the data.table package. What this basically doing is to first create a 20 week boundary column and then perform an a non-equi self join while counting the matches.
library(data.table)
setDT(df)[, visit_date := as.IDate(visit_date)] # Convert visit_date to a proper Date class
df[, visit_date20 := visit_date - 20*7] # Create a 20 weeks boundry
## Count previous visits within the range
df[df, .(Visits = .N),
on = .(patient_id, visit_date < visit_date, visit_date > visit_date20),
by = .EACHI]
# patient_id visit_date visit_date Visits
# 1: 1 2016-12-02 2016-07-15 0
# 2: 2 2016-12-02 2016-07-15 0
# 3: 1 2016-12-30 2016-08-12 2
# 4: 1 2016-12-15 2016-07-28 1
# 5: 3 2016-12-30 2016-08-12 0
# 6: 2 2017-02-01 2016-09-14 1
# 7: 1 2017-02-15 2016-09-28 3
# 8: 4 2017-02-10 2016-09-23 0
# 9: 5 2017-01-15 2016-08-28 0
# 10: 6 2017-03-01 2016-10-12 0
If I understood you well, here is a solution using the data.table package. I have found two options (but the first one has better performance)
Convert the original data frame into data.table object:
dt <- data.table(df) # Create a data table from the data frame
setorder(dt, patient_id, visit_date) # Sort by patient_id, then by visit_date
Define the week threshold parameter:
weekNum = 20L # Considering a threshold of: 20-weeks.
OPTION 1: Computing directly the number of weeks from visit_datecolumn
We define the following function that makes the calculation for each group:
visitFreq <- function(x) {
n <- length(x)
result <- numeric(n)
if (n > 1) {
for (i in 1:n) {
# For each row of the column by patient_id
ref <- x[i] # reference date
x.prev <- x[x < ref] # select previous dates
if (length(x.prev) > 0) {
x.prev <- sapply(x.prev, function(y) {
ifelse(difftime(ref, y, units = "weeks") <= weekNum, 1, 0)
})
result[i] <- sum(x.prev)
}
}
}
return(result)
}
For each x[i] it finds the number of previous visits and then computes whether the previous dates are within the defined thershold or not. Then just left to count the number of previous visits before within the threshold.
Once we know how to make the calculation, we just need to apply this function for the visit_datecolumn for each patient_id:
dt[, visits := visitFreq(visit_date), by = patient_id]
Note: The function visitFreqhas to be defined considering a vectorial function, that receives an array of visit_dateand should return an array of the same dimension.
OPTION 2: Creating an artificial variable that collects all visit_date for a given patient.
Now we need to create a function that makes the calculation for computing the number of weeks:
calc <- function(vec, x) {
vec.prev <- vec[vec < x] # Select all dates before x
n <- 0
if (length(vec.prev) > 0) {
vec.prev <- sapply(vec.prev, function(y) {
ifelse(difftime(x, y, units = "weeks") <= weekNum, 1, 0)
})
n <- sum(vec.prev)
}
return(n)
}
where:
vec: Is an array of dates
x : Is the reference date
We filter only by the dates previous to date x. Now we apply the sapply function for each element of vec, for computing the difference in time between y (each element of vec) and the reference date x using as units the number of weeks. The result will be 1 for any diff date less that weekNum or zero. Then the number of previous visits less than certain number of weeks from reference date will be just counting all 1 we get.
Now we use this function in a data.table object like this:
dt[, visits := .(list(visit_date)), by = patient_id]
[, visits := mapply(calc, visits, visit_date)][order(patient_id)][]
Let's explain it a little bit:
We create a visits variable that is a list of all dates for a given patient_id (because the by clause).
If we execute the first expression it will produce something like this:
> dt[, visits := .(list(visit_date)), by = patient_id][]
visit_id patient_id visit_date visits
1: 1 1 2016-12-02 2016-12-02,2016-12-15,2016-12-30,2017-02-15
2: 4 1 2016-12-15 2016-12-02,2016-12-15,2016-12-30,2017-02-15
3: 3 1 2016-12-30 2016-12-02,2016-12-15,2016-12-30,2017-02-15
4: 7 1 2017-02-15 2016-12-02,2016-12-15,2016-12-30,2017-02-15
5: 2 2 2016-12-02 2016-12-02,2017-02-01
6: 6 2 2017-02-01 2016-12-02,2017-02-01
7: 5 3 2016-12-30 2016-12-30
8: 8 4 2017-02-10 2017-02-10
9: 9 5 2017-01-15 2017-01-15
10: 10 6 2017-03-01 2017-03-01
>
The second statement (second []-block) just do the calculation re-assigning the previously created variable visits, but now counting the number or previous visits with respect the reference date. We need the mapply function to make the vectorial computation, on each invocation of cal function we have as input arguments: dt[i]$visits(a list) and the corresponding dt[i]$visit_date[i]. mapply just iterates over all i-elements invoking the function calc.
RESULT
Finally, the result will be:
> dt
visit_id patient_id visit_date visits
1: 1 1 2016-12-02 0
2: 4 1 2016-12-15 1
3: 3 1 2016-12-30 2
4: 7 1 2017-02-15 3
5: 2 2 2016-12-02 0
6: 6 2 2017-02-01 1
7: 5 3 2016-12-30 0
8: 8 4 2017-02-10 0
9: 9 5 2017-01-15 0
10: 10 6 2017-03-01 0
>
and I guess this is what you wanted.
Note: Probably it would be a way to get the calculation on the fly but I was not able to see how. Perhaps other folks can suggest a slightly more syntactically succinct way.
PERFORMANCE
I was wondering about which option has better performance (I expected the OPC1), let's check it:
library(microbenchmark)
op <- microbenchmark(
OP1 = copy(dt)[, visits := visitFreq(visit_date), by = patient_id],
OP2 = copy(dt)[, visits := .(list(visit_date)), by = patient_id][, visits := mapply(calc, visits, visit_date)],
times=100L)
print(op)
It produce the following output:
Unit: milliseconds
expr min lq mean median uq max neval cld
OP1 3.467451 3.552916 4.165517 3.642150 4.200413 7.96348 100 a
OP2 4.732729 4.832695 5.799648 5.063985 6.073467 13.17264 100 b
>
Therefore the first option has the best performance.
EDIT (added the solution proposed by: #DavidArenburg)
Let's include as the third option the join solution, but increasing the size of the input argument repeating the input vector, for example:
nSample <- 100
patient_id <- rep(c(1, 2, 1, 1, 3, 2, 1, 4, 5, 6), nSample)
visit_id <- 1:nSample
visit_date <- rep(as.Date(c('2016-12-02', '2016-12-02', '2016-12-30',
'2016-12-15', '2016-12-30', '2017-02-01',
'2017-02-15', '2017-02-10', '2017-01-15', '2017-03-01')), nSample)
df <- data.frame(visit_id, patient_id, visit_date)
opc3 <- function(df) {
df[, visit_date20 := visit_date - 20 * 7] # Create a 20 weeks boundry
## Count previous visits within the range
df[df, .(visits = .N),
on = .(patient_id, visit_date < visit_date, visit_date > visit_date20),
by = .EACHI]
}
dt <- data.table(df)
dt3 <- copy(dt)[, visit_date := as.IDate(visit_date)] # Convert visit_date to a proper Date class
library(microbenchmark)
op <- microbenchmark(
OP1 = copy(dt)[, visits := visitFreq(visit_date), by = patient_id],
OP2 = copy(dt)[, visits := .(list(visit_date)), by = patient_id][, visits := mapply(calc, visits, visit_date)],
OP3 = opc3(copy(dt3)),
times = 10L)
print(op)
I get the following results:
Unit: milliseconds
expr min lq mean median uq max neval cld
OP1 6315.73724 6485.111937 10744.808669 11789.230998 15062.957734 15691.445961 10 b
OP2 6266.80130 6431.330087 11074.441187 11773.459887 13928.861934 15335.733525 10 b
OP3 2.38427 2.845334 5.157246 5.383949 6.711482 8.596792 10 a
>
The #DavidArenburg solution scale much better when the number of rows increse.
How about this solution, using dplyr and lubridate?
library(lubridate)
no_of_weeks <- 4 #You want 20 here, but the result will be NULL for the example dataset you've given
df %>%
mutate(week_filter=visit_date<Sys.Date()-weeks(no_of_weeks)) %>%
group_by(patient_id) %>%
mutate(visits_previous_n_weeks=cumsum(week_filter)) %>%
ungroup()
I'm using a data.table in R to store a time series. I want to return a subset such that successive rows for the selected times are at least N seconds apart from the last row that was selected, e.g. if I have
library(data.table)
x <- data.table(t=c(0,1,3,4,5,6,7,10,16,17,18,20,21), v=1:13)
x
t v
1: 0 1
2: 1 2
3: 3 3
4: 4 4
5: 5 5
6: 6 6
7: 7 7
8: 10 8
9: 16 9
10: 17 10
11: 18 11
12: 20 12
13: 21 13
and I want to sample rows that are at least 5 seconds apart, starting from the first row, then I should get a data.table with time/value pairs:
y <- x[...something...]
y
t v
1: 0 1
2: 5 5
3: 10 8
4: 16 9
5: 21 13
The time samples don't have to be regularly spaced either, so I can't just take every M rows. Of course I could do this by looping through the data.table rows manually but I'm wondering if there's a more convenient way to express this using data.tables indexing.
Here are a couple ways to use rolling joins to find the set of rows, w, in your subset:
t_plus = 5
# one join per row visited
w <- c()
nxt <- 1L
while(!is.na(nxt)){
w <- c(w, nxt)
nxt <- x[.(t[nxt]+t_plus), on=.(t), roll=-Inf, which=TRUE]
}
# join once on all rows
w0 <- x[.(t+5), on=.(t), roll=-Inf, which=TRUE]
w <- c()
nxt <- 1L
while (!is.na(nxt)){
w <- c(w, nxt)
nxt <- w0[nxt]
}
Then you can subset like x[w].
Comments
In principle, there could be other subsets that satisfy the OP's condition "at least 5 seconds apart"; this is just the one found by iterating from the first row forward.
The second way is based on #DavidArenburg's answer to the Q&A Henrik linked above. Although the question seems the same, I couldn't get that approach to work fully here.
Generally, it's a bad idea to grow things in a loop in R (like I'm doing with w here). If you're running into performance problems, that might be a good area to improve in this code.
I need to put number on first or random item in the group.
I do following:
item<-sample(c("a","b", "c"), 30,replace=T)
week<-rep(c("1","2","3"),10)
volume<-c(1:30)
DT<-data.table(item, week,volume)
setkeyv(DT, c("item", "week"))
sampleDT <- DT[,.SD[1], by= list(item,week)]
item week volume newCol
1: a 1 1 5
2: a 2 14 5
3: a 3 6 5
4: b 1 13 5
5: b 2 2 5
6: b 3 9 5
7: c 1 7 5
8: c 2 5 5
9: c 3 3 5
DT[DT[,.SD[1], by= list(item,week)], newCol:=5]
The sampleDT comes out correct ,but last line puts 5 on all columns instead of conditioned ones.
What am I doing wrong?
I think you want to do this instead:
DT[DT[, .I[1], by = list(item, week)]$V1, newCol := 5]
Your version doesn't work because the join that you have results in the full data.table.
Also there is a pending FR to make the syntax simpler:
# won't work now, but maybe in the future
DT[, newCol[1] := 5, by = list(item, week)]
The problem with your command is that it is finding rows in the original data.table that have combinations of the keys [item, week] that you found in sampleDT. Since sampleDT includes all combinations of [item, week], you get the whole data.table back.
A simpler solution (I think) would be using !duplicated() to retrieve the first instance of each [item, week] combination:
DT[!duplicated(DT, c("item", "week") ), newCol := 5]
Joining two data.table I can specify the table I want the column from, like
X[Y, i.id] # `id` is taken from Y
My problem is that I have a big table with ~80 columns. Every night a data refresh happens and, according to some parameters, some rows get replaced by a new version of the table (same table, just new data).
current <- data.table(id=1:4, var=1:4, var2=1:4, key="id")
new <- data.table(id=1:4, var=11:14, var2=11:14, key="id")
current[new[c(1,3)], `:=`(var=i.var, var2=i.var2)]
> current
id var var2
1: 1 11 11
2: 2 2 2
3: 3 13 13
4: 4 4 4
As I said, in my real case, I have much more columns so (besides rbind()ing pieces of the two tables) I wonder how can I select all the columns of the data.table used in a join as the i argument? I could spend an half an hour in hard coding all of them but it wouldn't be a maintainable code (in case new columns get added to the tables in future).
How about constructing the j-expression and just eval'ing it?
nc = names(current)[-1L]
nn = paste0("i.", nc)
expr = lapply(nn, as.name)
setattr(expr, 'names', nc)
expr = as.call(c(quote(`:=`), expr))
> current[new[c(1,3)], eval(expr)]
> current
## id var var2
## 1: 1 11 11
## 2: 2 2 2
## 3: 3 13 13
## 4: 4 4 4
I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]