does anyone know if it is possible to use the na.approx function to interpolate depending on a varying timeframe ?
Let's suggest we have a df like (the real df has over 5'000 rows):
Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67
I want to conduct an linear interpolation depending on the date range. For example the two NA's at beginning: 1) 14-5 = 9, 2) Counting the days from 2022-05-02 until 2022-05-06 = 5 days, so we divide 3) 9/5 = 1.8.
Value for NA at 2022-05-03 is 6.8 and for 2022-05-06 is 8.6.
Second example at 2022-05-09: 1) 67-15 = 52, 2) 2022-05-08 until 2022-05-10 = 3 days, 3) 52/3 = 17.33333. Value for NA at 2022-05-09 is 32.33333 (= 15 + 17.33333)
Is this possible to conduct it with the na.approx function? If not, how can I approach this?
Using the data frame DF defined reproducibly in the Note at the end, we see that what is asked for, i.e. linear interpolation with respect to Date, is what you get (after correcting the calculations in the question) if you apply na.approx to a zoo series:
library(zoo)
z <- read.zoo(DF)
na.approx(z)
## 2022-05-01 2022-05-02 2022-05-03 2022-05-06 2022-05-07 2022-05-08 2022-05-09
## 6.0 5.0 6.8 12.2 14.0 15.0 41.0
## 2022-05-10
## 67.0
or in terms of the original data frame we can use the x= argument of na.approx to specify that interpolation is with respect to Date.
DF$Date <- as.Date(DF$Date)
transform(DF, Value = na.approx(Value, Date, na.rm = FALSE))
## Date Value
## 1 2022-05-01 6.0
## 2 2022-05-02 5.0
## 3 2022-05-03 6.8
## 4 2022-05-06 12.2
## 5 2022-05-07 14.0
## 6 2022-05-08 15.0
## 7 2022-05-09 41.0
## 8 2022-05-10 67.0
Suggest you review the documentation using ?na.approx from R.
Note
Lines <- "Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67"
DF <- read.csv(text = Lines, strip.white = TRUE)
I can do it with a bit of a slog, but I may need some help with the date diffs. For the first case there's 5 days between 02/05 & 07/05. In the second there's 2 days, not 3 between 08/05 & 10/05. Have I missed something? :)
Code below:
# get data into required shape, and using data.table package
df <- read.table(text="
Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67
", header=T)
library(data.table)
library(zoo)
library(lubridate)
dt <- as.data.table(df)
dt[, Date := lubridate::ymd(gsub(",","",`Date.`))]
setorder(dt, Date)
# first step, fill in to get the starting value
dt[, Value2 := zoo::na.locf0(Value)]
# group together the rows, only really interested in the NA ones,
# ensuring they are grouped together. rleid makes a group where it finds new values
dt[, Group := rleid(is.na(Value))]
# find the value after the NA
dt[, ValueNext := shift(Value2, n=1, type="lead")]
# find the dates before and after the NA period
dt[, DatePre := shift(Date, n=1, type="lag")]
dt[, DateNext := shift(Date, n=1, type="lead")]
# find the differences in the values & dates
dt[, ValueDiff := ValueNext[.N]-Value2[1], by=Group]
dt[, DateDiff := as.integer(DateNext[.N]-DatePre[1]), by=Group]
# divide through to get the addition
dt[, ValueAdd := ValueDiff/DateDiff]
# by group, use cumulative sum to add to the starting value
dt[, ValueOut := Value2+cumsum(ValueAdd), by=Group]
# we only care about NA groups, so revert back to original value for other
# cases
dt[!is.na(Value), ValueOut := Value]
# check the NA rows
# ! only difference is I get 2 as the date diff for 2022-05-09, not 3
dt[is.na(Value),]
# Final output
dt[, .(Date, Value, ValueOut)]
Related
I am trying to get my existing observations to 10 min intervals in R.
I did this:
data3$date= ceiling_date(as.POSIXct(data3$betdate), unit = "10 minutes")
data3 %>% group_by(date, prov) %>%
summarise(cant=n())
But the problem with this code it is that if there is no observation for one interval, the interval will not appear in the output data, which have a lot of sense because there are no observations with the date in that interval. So i need to capture the information about that intervals that does not have observations registred. Any ideas? Already thanks to all of you.
See a simplified example of #Limey's comment, using just months and data.table
# set up fake data
set.seed(1000)
library(lubridate)
# create sequence, and save it as a data.frame so it has a header
months <- seq(ymd("2022-01-01"), ymd("2022-06-01"), by = "month")
# create fake data, and remove some rows
dat <- data.frame(month = months, values = sample(100:200, length(months)))
dat <- dat[-sample(1:length(months),3),]
dat
# month values
#1 2022-01-01 167
#4 2022-04-01 150
#6 2022-06-01 128
here we perform the merge and see the NAs representing missing observations
library(data.table)
setDT(dat)
months_listed <- data.frame(month = seq(min(dat$month), max(dat$month), by = "month"))
setDT(months_listed)
merge.data.table(months_listed, dat, by = "month", all.x = T)
# month values
#1: 2022-01-01 167
#2: 2022-02-01 NA
#3: 2022-03-01 NA
#4: 2022-04-01 150
#5: 2022-05-01 NA
#6: 2022-06-01 128
I need to calculate the rolling 14 day average for a large data set. The data set is private, although I can share a small snippet.
The data set comes from an instrument in the field which does not operate every day. For instance, a snippet of the data frame would look like so:
Date, Value
2022-01-28, 196.00000
2022-01-31, 104.00000
2022-02-01, 0.00000
2022-02-02, 98.00000
2022-02-03, 0.00000
2022-02-07, 139.92308
2022-02-08, 114.50000
2022-02-09, 121.64286
2022-02-10, 96.50000
2022-02-11, 151.63636
2022-02-14, 85.87500
2022-02-15, 98.90000
2022-02-18, 209.40000
2022-02-21, 172.18182
2022-02-22, 0.00000
2022-02-23, 0.00000
2022-02-28, 264.00000
2022-03-01, 131.75000
2022-03-03, 119.33333
2022-03-04, 88.80000
2022-03-07, 152.16667
2022-03-08, 24.50000
I have the following plot.
library(zoo)
library(tidyverse)
ggplot(data=df_days, aes(x=Date, y=Value)) +
geom_line(color="black", lwd=0.5) +
geom_point(lwd=0.5) +
geom_line(y=rollmean(df_days$Value, 14, na.pad=TRUE), color="red", lwd=0.8)
I realised that I'm actually taking the 14 point average, i.e the average of 14 data points. Is there a way to take the 14 day average, based upon the dates themselves?
1) Using the input from the question shown reproducibly in the Note at the end we calculate the number of points to use at each date, w, and then use rollapplyr with that.
library(zoo)
within(DF, {
w <- seq_along(Date) - findInterval(Date - 14, Date)
mean14 <- rollapplyr(Value, w, mean)
})
giving the following where mean14 is the mean and w is the number of points used to calculate that mean. This is calculated in such a way that if there were no missing dates then it would give the same result as rollapplyr(DF$Value, 14, mean, partial = TRUE) but if there are missing dates then it uses fewer based on the number of dates in a 14 day window. (Note that using different numbers of points for each mean can affect the variance.)
Date Value mean14 w
1 2022-01-28 196.0000 196.00000 1
2 2022-01-31 104.0000 150.00000 2
3 2022-02-01 0.0000 100.00000 3
4 2022-02-02 98.0000 99.50000 4
5 2022-02-03 0.0000 79.60000 5
6 2022-02-07 139.9231 89.65385 6
7 2022-02-08 114.5000 93.20330 7
8 2022-02-09 121.6429 96.75824 8
9 2022-02-10 96.5000 96.72955 9
10 2022-02-11 151.6364 91.80026 9
11 2022-02-14 85.8750 89.78637 9
12 2022-02-15 98.9000 100.77526 9
13 2022-02-18 209.4000 127.29716 8
14 2022-02-21 172.1818 131.32951 8
15 2022-02-22 0.0000 117.01700 8
16 2022-02-23 0.0000 101.81165 8
17 2022-02-28 264.0000 124.08030 6
18 2022-03-01 131.7500 129.55530 6
19 2022-03-03 119.3333 128.09502 7
20 2022-03-04 88.8000 110.86645 7
21 2022-03-07 152.1667 108.00714 7
22 2022-03-08 24.5000 111.50714 7
2) Another approach is to add the missing dates, fill in Value in those missing dates with NA and then use rollapplyr.
m <- merge(DF, data.frame(Date = seq(min(DF$Date), max(DF$Date), 1)), all = TRUE)
na.omit(transform(m,
mean14 = rollapplyr(Value, 14, mean, na.rm = TRUE, partial = TRUE)))
3) A variation of the above is to use zoo objects. Note that fortify.zoo(zz) can be used to create a data frame from a zoo object.
library(zoo)
z <- read.zoo(DF)
# 1
tt <- time(z)
w <- seq_along(tt) - findInterval(tt - 14, tt)
zz <- rollapplyr(z, w, mean)
# 2
m <- merge(z, zoo(, seq(start(z), end(z), 1)))
zz <- na.omit(rollapply(m, 14, mean, na.rm = TRUE))
Note
Lines <- "Date, Value
2022-01-28, 196.00000
2022-01-31, 104.00000
2022-02-01, 0.00000
2022-02-02, 98.00000
2022-02-03, 0.00000
2022-02-07, 139.92308
2022-02-08, 114.50000
2022-02-09, 121.64286
2022-02-10, 96.50000
2022-02-11, 151.63636
2022-02-14, 85.87500
2022-02-15, 98.90000
2022-02-18, 209.40000
2022-02-21, 172.18182
2022-02-22, 0.00000
2022-02-23, 0.00000
2022-02-28, 264.00000
2022-03-01, 131.75000
2022-03-03, 119.33333
2022-03-04, 88.80000
2022-03-07, 152.16667
2022-03-08, 24.50000"
DF <- read.csv(text = Lines)
DF$Date <- as.Date(DF$Date)
There may be more elegant solutions, but you can fill in the missing dates with NA:
df$Date <- as.Date(df$Date)
library(dplyr)
library(tidyr)
df %>% complete(Date = seq(min(Date),max(Date),1), fill = list(Value = NA))
Output:
# A tibble: 40 × 2
# Date Value
# <date> <dbl>
# 1 2022-01-28 196
# 2 2022-01-29 NA
# 3 2022-01-30 NA
# 4 2022-01-31 104
# 5 2022-02-01 0
# 6 2022-02-02 98
# ...
I am trying to use dplyr in R to calculate rolling stats (mean, sd, etc) based on a dynamic window based on dates and for specific models. For instance, within groupings of items, I would like to calculate the rolling mean for all data 10 days prior. The dates on the data are not sequential and not complete so I can't use a fixed window.
One way to do this is use rollapply referencing the window width as shown below. However, I'm having trouble calculating the dynamic width. I'd prefer a method that omits the intermediate step of calculating the window and simply calculate based on the date_lookback. Here's a toy example.
I've used for loops to do this, but they are very slow.
library(dplyr)
library(zoo)
date_lookback <- 10 #days to look back for rolling calcs
df <- data.frame(label = c(rep("a",5),rep("b",5)),
date = as.Date(c("2017-01-02","2017-01-20",
"2017-01-21","2017-01-30","2017-01-31","2017-01-05",
"2017-01-08","2017-01-09","2017-01-10","2017-01-11")),
data = c(790,493,718,483,825,186,599,408,108,666),stringsAsFactors = FALSE) %>%
mutate(.,
cut_date = date - date_lookback, #calcs based on sample since this date
dyn_win = c(1,1,2,3,3,1,2,3,4,5), ##!! need to calculate this vector??
roll_mean = rollapply(data, align = "right", width = dyn_win, mean),
roll_sd = rollapply(data, align = "right", width = dyn_win, sd))
These are the roll_mean and roll_sd results I'm looking for:
> df
label date data cut_date dyn_win roll_mean roll_sd
1 a 2017-01-02 790 2016-12-23 1 790.0000 NA
2 a 2017-01-20 493 2017-01-10 1 493.0000 NA
3 a 2017-01-21 718 2017-01-11 2 605.5000 159.0990
4 a 2017-01-30 483 2017-01-20 3 564.6667 132.8847
5 a 2017-01-31 825 2017-01-21 3 675.3333 174.9467
6 b 2017-01-05 186 2016-12-26 1 186.0000 NA
7 b 2017-01-08 599 2016-12-29 2 392.5000 292.0351
8 b 2017-01-09 408 2016-12-30 3 397.6667 206.6938
9 b 2017-01-10 108 2016-12-31 4 325.2500 222.3921
10 b 2017-01-11 666 2017-01-01 5 393.4000 245.5928
Thanks in advance.
You could try explicitly referencing your dataset inside the dplyr call:
date_lookback <- 10 #days to look back for rolling calcs
df <- data.frame(label = c(rep("a",5),rep("b",5)),
date = as.Date(c("2017-01-02","2017-01-20",
"2017-01-21","2017-01-30","2017-01-31","2017-01-05",
"2017-01-08","2017-01-09","2017-01-10","2017-01-11")),
data = c(790,493,718,483,825,186,599,408,108,666),stringsAsFactors = FALSE)
df %>%
group_by(date,label) %>%
mutate(.,
roll_mean = mean(ifelse(df$date >= date-date_lookback & df$date <= date & df$label == label,
df$data,NA),na.rm=TRUE),
roll_sd = sd(ifelse(df$date >= date-date_lookback & df$date <= date & df$label == label,
df$data,NA),na.rm=TRUE))
I have created the following 2 dummy datasets as follows:
id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))
id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))
Note that only some of the id in data are included in ns and viceversa.
For each of the values in data$id I am trying to find if there is a ns$date that is 14 days before the data$date where data$id==ns$id and report the number of days difference.
The output I need is a vector/column ("received") of the same number of rows of data, with a TRUE/FALSE whre ns$date[ns$id==data$id] is less than 14 days before the respective data$date and a similar vector with the actual number of days where "received" is TRUE. I hope this makes sense now.
This is where I got so far
# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)
# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and tmp$date as date.y
# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)
This link Find if date is within 14 days if id matches between datasets in R provides an idea but the result does not include the number of days difference in tmp$dif.
In the result table I need only the lowest difference for each data$id for those cases were tmp$received was TRUE.
Hope this makes more sense now? If not please let me know what needs further clarification.
M
PS: as requested I added what the desired output should look like (same number of rows of data = 10 - no rows for data in ns not in data). Should have thought this might help earlier.
id date received dif
1 8 2011-03-03 TRUE 6
2 8 2011-12-12 FALSE NA
3 50 2010-08-18 TRUE 1
4 87 2009-04-28 TRUE 0
5 141 2010-11-29 NA NA
6 161 2012-04-02 NA NA
7 192 2013-01-08 NA NA
8 216 2007-01-22 NA NA
9 257 2009-06-03 NA NA
10 282 2009-12-02 TRUE 4
Here's a data.table approach
Converting to data.table objects
library(data.table)
setkey(setDT(data), id)
setkey(setDT(ns), id)
Merging
ns <- ns[data]
Converting to Date class
ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]
Computing days differences and TRUE/FALSE
ns[, `:=`(timediff = date.1 - date,
Logical = (date.1 - date) < 14)]
Taking only the rows we are interested in
res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]
Sorting by id and date
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)
Subsetting by minimum dstance
res[, list(diff = min(dif)), by = list(id, date.1, received)]
# id date.1 received diff
# 1: 8 2011-03-03 TRUE 6 days
# 2: 8 2011-12-12 FALSE NA days
# 3: 50 2010-08-18 TRUE 1 days
# 4: 87 2009-04-28 TRUE 0 days
# 5: 141 2010-11-29 NA NA days
# 6: 161 2012-04-02 NA NA days
# 7: 192 2013-01-08 NA NA days
# 8: 216 2007-01-22 NA NA days
# 9: 257 2009-06-03 NA NA days
# 10: 282 2009-12-02 TRUE 4 days
I'm trying to figure out the fastest way to aggregate a large data frame (about 50M rows) that looks similar to:
>sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
+ "date" = sample(seq(as.Date("2014-01-01"),as.Date("2014-02-13"),by=1),6),
+ "value" = runif(6))
> sample_frame
id date value
1 73 2014-02-11 0.84197491
2 7 2014-01-14 0.08057893
3 73 2014-01-16 0.78521616
4 7 2014-01-24 0.61889286
5 73 2014-02-06 0.54792356
6 7 2014-01-06 0.66484848
Here we have 2 unique IDs with 3 dates and a value assigned to each. I know that I can use ddply, or data.table, or just a lapply to aggregate and find the mean for each ID.
What I'm really looking for is a way to quickly find the mean for each ID for the most recent two dates. For example, with sapply:
> sapply(split(sample_frame,sample_frame$id),function(x){
+ mean(x$value[x$date%in%x$date[order(x$date,decreasing=T)][1:2]])
+ })
7 73
0.3497359 0.6949492
I can't figure out how to get data.table to do this. Thoughts? Hints?
Why not use tail in your "data.table" aggregation step?
set.seed(1)
sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
"date" = sample(seq(as.Date("2014-01-01"),
as.Date("2014-02-13"),by=1),6),
"value" = runif(6))
DT <- data.table(sample_frame, key = "id,date")
DT
# id date value
# 1: 27 2014-01-09 0.20597457
# 2: 27 2014-01-26 0.62911404
# 3: 27 2014-02-07 0.68702285
# 4: 37 2014-02-06 0.17655675
# 5: 37 2014-02-09 0.06178627
# 6: 37 2014-02-13 0.38410372
DT[, mean(tail(value, 2)), by = id]
# id V1
# 1: 27 0.6580684
# 2: 37 0.2229450
Since you require the mean of just two values, you can do it directly (without using mean). And you can use the internal variable .N instead of tail to get more speed-up. You just have to take care of the case where there's just 1 date. Basically, this should be much faster.
DT[, (value[.N]+value[max(1L, .N-1)])/2, by=id]