Conditional accumulating sum with a dynamic condition - r

Good afternoon
i'm trying to create an accumulating mean with a "twist" - i only want to average fields that dated before the current one (there may be field with the same date)
i succeeded doing it the "dirty way" using several custom created function but it takes too long and it's very inefficient - im pretty sure that there is a better way.
i was thinking about something along the line of:
averages <- DB %>% group_by(field1,field2) %>% mutate(Avg=cummean(???*value1)))
how do i access the current observation for cummean function
the route i went was to create a logical vector for every subset with a loop
for (i in 1:length(datevector)-1)
logicalvector[i] <- datevector[length(datevector)]>datevector[i]
logicalvector[length(datevector)]=F
and use that in another function to calculate the mean
a simple example is:
df <- data.frame(id=1:5,Date=as.Date(c("2013-08-02","2013-08-02","2013-08-03","2013-08-03","2013-08-04")),Value=c(1,4,5,2,4))
id Date Value accum mean
1 02/08/2013 1 0
2 02/08/2013 4 0
3 03/08/2013 5 2.5
4 03/08/2013 2 2.5
5 04/08/2013 4 3
Explanation:
there are no observation with a prior date for the first 2 observations so the mean is 0
the 3rd observation averages the 1st and 2nd, so does the 4th.
the 5th observation averages all

This can be implwemented as a complex self-join in SQL. This joins to each row all rows with lesser Date amd for each row averages Value in the joined rows. coalesce is used to assign 0 in the situation where the mean would otherwise be Null.
library(sqldf)
sqldf("select a.*, coalesce(avg(b.Value), 0) as mean
from df as a
left join df as b on b.Date < a.Date
group by a.rowid")
giving:
id Date Value mean
1 1 2013-08-02 1 0.0
2 2 2013-08-02 4 0.0
3 3 2013-08-03 5 2.5
4 4 2013-08-03 2 2.5
5 5 2013-08-04 4 3.0

Using data.table and lubridate you have this option:
library(data.table)
library(lubridate)
dt <- data.table(id=c(1:5))
dt$Date <- c("02/08/2013", "02/08/2013", "03/08/2013", "03/08/2013", "04/08/2013")
dt$Value <- c(1,4,5,2,4)
dt$Date <- dmy(dt$Date)
cummean <- function(d){
if(nrow(dt[Date<d])>0)
dt[Date<d, sum(Value)/.N]
else 0
}
dt[, accuMean:=mapply(cummean,Date)]
# id Date Value accuMean
#1: 1 2013-08-02 1 0.0
#2: 2 2013-08-02 4 0.0
#3: 3 2013-08-03 5 2.5
#4: 4 2013-08-03 2 2.5
#5: 5 2013-08-04 4 3.0
Solution when you have multiple values:
library(data.table)
library(lubridate)
dt <- data.table(id=c(1:5))
dt$Date <- c("02/08/2013", "02/08/2013", "03/08/2013", "03/08/2013", "04/08/2013")
dt$Value_1 <- c(1,4,5,2,4)
dt$Value_2 <- c(3,2,0,1,2)
dt$Value_3 <- c(4,9,3,3,3)
dt$Date <- dmy(dt$Date)
cummean <- function(d,Value){
if(nrow(dt[Date<d])>0)
sum(dt[Date<d, Value, with=F])/dt[Date<d, .N]
else 0
}
n <- 3
accuMean <- paste0("accuMean_", (1:n))
for(i in 1:n){
print(i)
dt[, (accuMean[i]):=mapply(cummean,Date,MoreArgs = list(paste0("Value_",i)))]
}
Suppose you have n values named Value_i. Ten in your case, you only need to set n=10

Related

Replacing missing values in time series data in R

I am new to R. I was hoping to replace the missing values for X in the data. How can I replace the missing values of "X" when "Time" = 1 or 2 with the value of "X" when "Time" = 3 for the same "SubID" and the same "Day"
SubID: subject number
Day: each subject's day number (1,2,3...21)
Time: morning marked as 1, afternoon marked as 2, and evening marked as
3
X: only has a valid value when Time is 3, others are missing.
SubID Day Time X
1 1 1 NA
1 1 2 NA
1 1 3 7.4
1 2 1 NA
1 2 3 6.2
2 1 1 NA
2 1 2 NA
2 1 3 7.1
2 2 3 5.9
2 2 2 NA
2 2 1 NA
I was able to go as far as the following codes in zoo. I have very limited experience in R. Thank you in advance!
data2 <- transform(data1,
x = na.aggregate(x,by=SubID,FUN=sum,na.rm = T))
Here's the explanation of my comment:
library(data.table)
library(zoo)
setDT(data1)
data1[order(-Time),
Xf := na.locf(X),
by = .(SubID, Day)]
Ok so the setDT function makes the data1 object a data.table. Then order(-Time) orders data1 with respect to Time in descending order (because of the -). Xf := na.locf(X) creates a new column Xf by reference (which means you don't have to assign this back to data1) as na.locf(X) which is a function in the zoo package that fills the NAs forward with the previous value (in this case filling 2 and 1 with the value in 3). The last line specifies that we want to do this grouped by SubID and Day.
Hope it's clearer now, feel free to ask if you have further doubts.
You can sort the data by descending time and then use X[1].
library(dplyr)
df <- tibble(SubID=1, Day=1, Time=c(1,2,3), X=c(NA, NA, 2.2))
df <- df %>%
group_by(SubID, Day) %>%
arrange(desc(Time)) %>%
mutate(
X=case_when(
is.na(X) ~ X[1],
TRUE ~ X)
)

Data Frame: how to add a column to count previous occurrences in the last 20 weeks?

This is my first question here in a long time :).
I've got a data frame with data about patient visits to a clinic.
visit_id <- c(1,2,3,4,5,6,7,8,9,10)
patient_id <- c(1,2,1,1,3,2,1,4,5,6)
visit_date <- as.Date(c('2016-12-02','2016-12-02','2016-12-30',
'2016-12-15','2016-12-30','2017-02-01',
'2017-02-15','2017-02-10','2017-01-15','2017-03-01'))
df <- data.frame(visit_id,patient_id,visit_date,visits_previous_20_weeks)
It looks like this:
visit_id patient_id visit_date
1 1 1 2016-12-02
2 2 2 2016-12-02
3 3 1 2016-12-30
4 4 1 2016-12-15
5 5 3 2016-12-30
6 6 2 2017-02-01
7 7 1 2017-02-15
8 8 4 2017-02-10
9 9 5 2017-01-15
10 10 6 2017-03-01
I want to add one more column that would indicate the number of times the patient has been to the clinic in the last 20 weeks:
visit_id patient_id visit_date visits_previous_20_weeks
1 1 1 2016-12-02 0
2 2 2 2016-12-02 0
3 3 1 2016-12-30 2
4 4 1 2016-12-15 1
5 5 3 2016-12-30 0
6 6 2 2017-02-01 1
7 7 1 2017-02-15 3
8 8 4 2017-02-10 0
9 9 5 2017-01-15 0
10 10 6 2017-03-01 0
The only data source is this table. So in the beginning of the table, since this is the first record, patient 1 has been to the clinic 0 times. But on the December 15th, 2016, the patient comes back to the clinic. So the number of visits in the previous 20 weeks (as of that date) is 1.
One inefficient way to do this would be to create a loop that for each row in the data frame, would go through the whole data frame and tally the number of visits for same patient in the previous 20 weeks. Any better way to do this in R?
Thanks :)
Here's a way using the data.table package. What this basically doing is to first create a 20 week boundary column and then perform an a non-equi self join while counting the matches.
library(data.table)
setDT(df)[, visit_date := as.IDate(visit_date)] # Convert visit_date to a proper Date class
df[, visit_date20 := visit_date - 20*7] # Create a 20 weeks boundry
## Count previous visits within the range
df[df, .(Visits = .N),
on = .(patient_id, visit_date < visit_date, visit_date > visit_date20),
by = .EACHI]
# patient_id visit_date visit_date Visits
# 1: 1 2016-12-02 2016-07-15 0
# 2: 2 2016-12-02 2016-07-15 0
# 3: 1 2016-12-30 2016-08-12 2
# 4: 1 2016-12-15 2016-07-28 1
# 5: 3 2016-12-30 2016-08-12 0
# 6: 2 2017-02-01 2016-09-14 1
# 7: 1 2017-02-15 2016-09-28 3
# 8: 4 2017-02-10 2016-09-23 0
# 9: 5 2017-01-15 2016-08-28 0
# 10: 6 2017-03-01 2016-10-12 0
If I understood you well, here is a solution using the data.table package. I have found two options (but the first one has better performance)
Convert the original data frame into data.table object:
dt <- data.table(df) # Create a data table from the data frame
setorder(dt, patient_id, visit_date) # Sort by patient_id, then by visit_date
Define the week threshold parameter:
weekNum = 20L # Considering a threshold of: 20-weeks.
OPTION 1: Computing directly the number of weeks from visit_datecolumn
We define the following function that makes the calculation for each group:
visitFreq <- function(x) {
n <- length(x)
result <- numeric(n)
if (n > 1) {
for (i in 1:n) {
# For each row of the column by patient_id
ref <- x[i] # reference date
x.prev <- x[x < ref] # select previous dates
if (length(x.prev) > 0) {
x.prev <- sapply(x.prev, function(y) {
ifelse(difftime(ref, y, units = "weeks") <= weekNum, 1, 0)
})
result[i] <- sum(x.prev)
}
}
}
return(result)
}
For each x[i] it finds the number of previous visits and then computes whether the previous dates are within the defined thershold or not. Then just left to count the number of previous visits before within the threshold.
Once we know how to make the calculation, we just need to apply this function for the visit_datecolumn for each patient_id:
dt[, visits := visitFreq(visit_date), by = patient_id]
Note: The function visitFreqhas to be defined considering a vectorial function, that receives an array of visit_dateand should return an array of the same dimension.
OPTION 2: Creating an artificial variable that collects all visit_date for a given patient.
Now we need to create a function that makes the calculation for computing the number of weeks:
calc <- function(vec, x) {
vec.prev <- vec[vec < x] # Select all dates before x
n <- 0
if (length(vec.prev) > 0) {
vec.prev <- sapply(vec.prev, function(y) {
ifelse(difftime(x, y, units = "weeks") <= weekNum, 1, 0)
})
n <- sum(vec.prev)
}
return(n)
}
where:
vec: Is an array of dates
x : Is the reference date
We filter only by the dates previous to date x. Now we apply the sapply function for each element of vec, for computing the difference in time between y (each element of vec) and the reference date x using as units the number of weeks. The result will be 1 for any diff date less that weekNum or zero. Then the number of previous visits less than certain number of weeks from reference date will be just counting all 1 we get.
Now we use this function in a data.table object like this:
dt[, visits := .(list(visit_date)), by = patient_id]
[, visits := mapply(calc, visits, visit_date)][order(patient_id)][]
Let's explain it a little bit:
We create a visits variable that is a list of all dates for a given patient_id (because the by clause).
If we execute the first expression it will produce something like this:
> dt[, visits := .(list(visit_date)), by = patient_id][]
visit_id patient_id visit_date visits
1: 1 1 2016-12-02 2016-12-02,2016-12-15,2016-12-30,2017-02-15
2: 4 1 2016-12-15 2016-12-02,2016-12-15,2016-12-30,2017-02-15
3: 3 1 2016-12-30 2016-12-02,2016-12-15,2016-12-30,2017-02-15
4: 7 1 2017-02-15 2016-12-02,2016-12-15,2016-12-30,2017-02-15
5: 2 2 2016-12-02 2016-12-02,2017-02-01
6: 6 2 2017-02-01 2016-12-02,2017-02-01
7: 5 3 2016-12-30 2016-12-30
8: 8 4 2017-02-10 2017-02-10
9: 9 5 2017-01-15 2017-01-15
10: 10 6 2017-03-01 2017-03-01
>
The second statement (second []-block) just do the calculation re-assigning the previously created variable visits, but now counting the number or previous visits with respect the reference date. We need the mapply function to make the vectorial computation, on each invocation of cal function we have as input arguments: dt[i]$visits(a list) and the corresponding dt[i]$visit_date[i]. mapply just iterates over all i-elements invoking the function calc.
RESULT
Finally, the result will be:
> dt
visit_id patient_id visit_date visits
1: 1 1 2016-12-02 0
2: 4 1 2016-12-15 1
3: 3 1 2016-12-30 2
4: 7 1 2017-02-15 3
5: 2 2 2016-12-02 0
6: 6 2 2017-02-01 1
7: 5 3 2016-12-30 0
8: 8 4 2017-02-10 0
9: 9 5 2017-01-15 0
10: 10 6 2017-03-01 0
>
and I guess this is what you wanted.
Note: Probably it would be a way to get the calculation on the fly but I was not able to see how. Perhaps other folks can suggest a slightly more syntactically succinct way.
PERFORMANCE
I was wondering about which option has better performance (I expected the OPC1), let's check it:
library(microbenchmark)
op <- microbenchmark(
OP1 = copy(dt)[, visits := visitFreq(visit_date), by = patient_id],
OP2 = copy(dt)[, visits := .(list(visit_date)), by = patient_id][, visits := mapply(calc, visits, visit_date)],
times=100L)
print(op)
It produce the following output:
Unit: milliseconds
expr min lq mean median uq max neval cld
OP1 3.467451 3.552916 4.165517 3.642150 4.200413 7.96348 100 a
OP2 4.732729 4.832695 5.799648 5.063985 6.073467 13.17264 100 b
>
Therefore the first option has the best performance.
EDIT (added the solution proposed by: #DavidArenburg)
Let's include as the third option the join solution, but increasing the size of the input argument repeating the input vector, for example:
nSample <- 100
patient_id <- rep(c(1, 2, 1, 1, 3, 2, 1, 4, 5, 6), nSample)
visit_id <- 1:nSample
visit_date <- rep(as.Date(c('2016-12-02', '2016-12-02', '2016-12-30',
'2016-12-15', '2016-12-30', '2017-02-01',
'2017-02-15', '2017-02-10', '2017-01-15', '2017-03-01')), nSample)
df <- data.frame(visit_id, patient_id, visit_date)
opc3 <- function(df) {
df[, visit_date20 := visit_date - 20 * 7] # Create a 20 weeks boundry
## Count previous visits within the range
df[df, .(visits = .N),
on = .(patient_id, visit_date < visit_date, visit_date > visit_date20),
by = .EACHI]
}
dt <- data.table(df)
dt3 <- copy(dt)[, visit_date := as.IDate(visit_date)] # Convert visit_date to a proper Date class
library(microbenchmark)
op <- microbenchmark(
OP1 = copy(dt)[, visits := visitFreq(visit_date), by = patient_id],
OP2 = copy(dt)[, visits := .(list(visit_date)), by = patient_id][, visits := mapply(calc, visits, visit_date)],
OP3 = opc3(copy(dt3)),
times = 10L)
print(op)
I get the following results:
Unit: milliseconds
expr min lq mean median uq max neval cld
OP1 6315.73724 6485.111937 10744.808669 11789.230998 15062.957734 15691.445961 10 b
OP2 6266.80130 6431.330087 11074.441187 11773.459887 13928.861934 15335.733525 10 b
OP3 2.38427 2.845334 5.157246 5.383949 6.711482 8.596792 10 a
>
The #DavidArenburg solution scale much better when the number of rows increse.
How about this solution, using dplyr and lubridate?
library(lubridate)
no_of_weeks <- 4 #You want 20 here, but the result will be NULL for the example dataset you've given
df %>%
mutate(week_filter=visit_date<Sys.Date()-weeks(no_of_weeks)) %>%
group_by(patient_id) %>%
mutate(visits_previous_n_weeks=cumsum(week_filter)) %>%
ungroup()

Counting across subsets in dataframes the R-way

In R I have a dataframe df of this form:
a b id
1 2 1234758
1 1 1234758
3 5 1234759
5 5 1234759
5 5 1234759
2 2 1234760
I want to count how many times I observe a change between df$a and df$b for each id value.
Since I am interested in some the numbers of ID as well I did
summary <- as.data.frame(table(df$id))
id n_id
2 1234758
3 1234759
1 1234760
then I wrote this to compute the number of times a and b changes for each id (both df and summary are sorted by id)
summary$jumps <- 0
k <- 1
for(i in 1:nrow(summary)) {
n <- summary$n_id[i]
for(j in k:k+n-1) if(df$a[j] != df$b[j]) summary$jumps[i] <- summary$jumps[i] + 1
k <- k + n
}
which looks up each entry in df the number of time each id appears, checks a and b and update the appropriate entry in summary. This works as intended, but it looks slow to me.
I am fairly new to R, what would we be a more efficient way of doing this in R? How would you vectorize this code? The dataframe has about 30 million such lines.
If I understand your problem correctly, you could use data.table to solve it in a fairly straight forward way;
library(data.table)
dt = setDT(df) # Create a data table
dt[,list(jumps=sum(a!=b)),by=id] # Sum rows where a!=b grouped by id
# id jumps
# 1: 1234758 1
# 2: 1234759 1
# 3: 1234760 0
We can use aggregate from base R
aggregate(cbind(jumps = a!=b)~id, df1, FUN=sum)

R: Extract data based on date, "if date lesser than"

I have a dataset with two values for each date like this:
date x y
1 2013-05-01 1 2
2 2013-05-02 2 2
3 2013-05-03 3 2
date is in the format as.Date, using the package lubridate.
Now I want to have the mean of the two values, except for a certain time span, in which I want to use the values of x.
I tried the following:
mean=(x+y)/2
newdata=ifelse((data$date < 2013-10-01 | date$date > 2014-04-09), mean, x)
but if will just take the mean for all dates.
Is it possible to use greater/lesser than relationships for dates?
Any suggestions on how to make this work?
Thanks in advance
It looks like you are not casting the comparison values as dates. Also the dates you used for comparison don't exclude any of the dates in the dataframe you provided so I'd expect the mean to be selected every time.
date <- as.Date(c('2013-05-01', '2013-05-02', '2013-05-03'))
x <- c(1, 2, 3)
y <- c(2, 2, 2)
mean <- (x + y)/2
df <- data.frame(date = date, x = x, y = y)
newdata <- ifelse((df$date < as.Date('2013-05-02') | df$date > as.Date('2014-04-09')), mean, x)
newdata
I changed the dates in the condition to be more selective and I got 1.5 2.0 3.0. It selects the first value from mean and the others from x which agrees with the condition I used in the ifelse().
How about something like this:
library(lubridate)
library(data.table)
##
set.seed(123)
Data <- data.frame(
date=as.Date(ymd(20130904))+0:364,
x=as.numeric(sample(1:3,365,replace=TRUE)),
y=as.numeric(sample(1:3,365,replace=TRUE)))
setDT(Data)
##
xSpan <- seq.Date(
from=as.Date("2013-10-01"),
to=as.Date("2014-04-09"),
by="day")
##
Edited - forgot to group by date
Data[,z:=ifelse(
date %in% xSpan,
x,
mean(c(x,y))),
by=date]
##
> head(Data)
date x y z
1: 2013-09-04 1 3 2.0
2: 2013-09-05 3 1 2.0
3: 2013-09-06 2 1 1.5
4: 2013-09-07 3 2 2.5
5: 2013-09-08 3 2 2.5
6: 2013-09-09 1 2 1.5
> head(subset(Data, date %in% xSpan))
date x y z
1: 2013-10-01 2 3 2
2: 2013-10-02 1 3 1
3: 2013-10-03 1 1 1
4: 2013-10-04 3 1 3
5: 2013-10-05 3 1 3
6: 2013-10-06 3 1 3
I just defined xSpan as a contiguous sequence of days for which one of the functions is used in (in your example, just the identity function of x). Dates not included in this time span will use mean to determine their value of z.

R for each row calculate a sum taking values of one of the columns from the rows BEFORE that satisfy special condition

I am using R. I have a big dataset of more than 400K lines. Here is the code to reproduce 5 lines of similar data frame:
Date = as.Date(c("2013-01-03", "2013-01-03", "2013-01-04", "2013-01-04", "2013-01-05"))
CustomerID = as.factor(c("A", "B", "A", "C", "A"))
PurchaseNS = c(13, 14, 12, 8, 10)
df = data.frame(Date, CustomerID, PurchaseNS)
> df
Date CustomerID PurchaseNS
1 2013-01-03 A 13
2 2013-01-03 B 14
3 2013-01-04 A 12
4 2013-01-04 C 8
5 2013-01-05 A 10
What I need is to add an additional column that for an each row equals to the sum of previous purchases of this customer. So in the end I am trying to get next data frame:
> df
Date CustomerID PurchaseNS previousPurchases
1 2013-01-03 A 13 0
2 2013-01-03 B 14 0
3 2013-01-04 A 12 13
4 2013-01-04 C 8 0
5 2013-01-05 A 10 25
I can achieve that with for loop, but it takes too much time and I know that it's not recommended to use loops in R.
for (i in 1:nrow(df)) {
df[i, 4] = sum(subset(df, df$CustomerID == df$CustomerID[i] & df$Date < df$Date[i])$PurchaseNS)
}
I also tried using sapply, but the code ends up looking similar to the one above and also takes too much time.
sapply(1:nrow(df), function(i) df[i, 4] =
sum(subset(df, df$CustomerID == df$CustomerID[i] & df$Date < df$Date[i])$Purchase))
I guess, functions like by, with, cumsum, apply could be useful, but so far I was not able to apply them.
Thanks in advance for your suggestions!
You can try:
df$prevPurch <- ave(
df$PurchaseNS, df$CustomerID,
FUN=function(x) cumsum(c(0, head(x, -1)))
)
which produces:
# Date CustomerID PurchaseNS prevPurch
# 1 2013-01-03 A 13 0
# 2 2013-01-03 B 14 0
# 3 2013-01-04 A 12 13
# 4 2013-01-04 C 8 0
# 5 2013-01-05 A 10 25
ave breaks up a vector by the groups in another vector, and then applies a function to each group.

Resources