Mean value between dates based on dates in another dataset using R - r

I have two data frames "Conc" and "Flow".
Flow has a value for every day for a set period, whereas Conc only has a value on certain days over the period of the period.
What I want to be able to do is calculate the mean Flow values for each period in between the Conc values using r.
The following code will generate two example data frames to illustrate the kind of datasets I am working with:
Conc <- data.frame(Date = as.Date(c("2012/01/13", "2012/02/16", "2012/05/02", "2012/07/28",
"2012/11/10")), Conc = c(0.88, 0.55, 0.34, 0.21, 0.98))
Flow <- data.frame(Date = c(seq(as.Date("2012/01/01"), by = "day", length.out = 365)),
Flow = c(sample(seq(from = 0.01, to = 5, by = 0.1), size = 365, replace = TRUE)))
The output data frame would ideally be something like:
Period Mean_Flow
1 2.01
2 1.41
3 3.81
4 0.31
I appreciate the variable time between Conc days makes this tricky. At present the best I have been to come up with is to manually do this in excel but I would really like to find an R solution to save myself having to do this for about 10 different dataset that I have.
Thank you.

Here's a possible approach using data.table package foverlaps function:
Create time intervals in both data sets
library(data.table)
Conc <- setDT(Conc)[, `:=`(start = Date, end = c(Date[2:(.N - 1)] - 1, Date[.N], NA))][-.N]
Flow <- setDT(Flow)[, `:=`(start = Date, end = Date)]
Key the Flow data set in order to use foverlaps function and run the function
setkey(Flow, start, end)
overlaps <- foverlaps(Conc, Flow, type = "any", which = TRUE)
Create indexes of the overlaps within the Flow data set and compute the mean by those indexes
Flow[overlaps$yid, Period := overlaps$xid]
na.omit(Flow[, list(Mean_Flow = mean(Flow)), by = Period])
# Period Mean_Flow
# 1: 1 2.189412
# 2: 2 2.263947
# 3: 3 2.762874
# 4: 4 2.349048

The following uses a loop going along all available dates in Conc$Date.
Conc$Date is for convenience put in vector A. The variable p signifies the values that should be taken into account. The loop stops with a NaN as the loop surpasses the last given date.
A <- Conc$Date
for(i in 1:length(A))
{p <- which(Flow$Date>A[i] & Flow$Date<A[i+1])
M<-mean(Flow$Flow[p])
print(M)}

Related

adjust "width" argument in rollapply() function in r for discontinuous dates

I have a dataset of daily remotely sensed data. In short, it's reflectance (values between 0 and 1) for the last 20 years. Because it's remotely sensed data, some dates do not have a value because of clouds or some other obstruction.
I want to use rollapply() in R's zoo package to detect in the time series when the values remain at 1.0 for a certain amount of time (let's say 2 weeks) or at 0 for that same amount of time.
I have code to do this, but the width argument in the rollapply() function (the 2-week threshold mentioned in the previous paragraph) looks at data points rather than time. So it looks at 14 data values rather than 14 days, which may span over a month due to the missing data values from cloud cover etc.
Here's an example:
test_data <- data.frame(date = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"),
value = c(0, 1, 1, 1, 0))
test_data$date <- ymd(test_data$date)
select_first_1_value <- test_data %>%
mutate(value = rollapply(value, width = 3, min, align = "left", fill = NA, na.rm = TRUE)) %>%
filter(value == 1) %>%
filter(row_number() == 1) %>%
ungroup
With the argument as width = 3, it works. It finds that 2000-01-02 is the first date where a value = 1 occurs for at least 3 values. However, if I change this to 14, it no longer works, because it only sees 5 values in this instance. Even if I wrote out an additional 10 values that equal 1 (for a total of 15), it would be incorrect because the value = 0 at 2000-01-18 and it is only counting data points and not dates.
But when we look at the dates, there are missing dates between 2000-01-03 and 2000-01-17. If both are a value = 1, then I want to extract 2000-01-02 as the first instance where the time series remains at 1 for at least 14 consecutive days. Here, I'm assuming that the values are 1 for the missing days.
Any help is greatly appreciated. Thank you.
There really are two problems here:
How to roll by date rather than number of points.
how to find the first stretch of 14 days of 1's assuming that missing dates are 1.
Note that (2) is not readily solved by (1) because the start of the first series of ones may not be any of the listed dates! For example, suppose we change the first date to Dec 1, 1999 giving test_data2 below. Then the start of the first period of 14 ones is Dec 2, 1999. It is not any of the dates in the test_data2 series.
test_data2 <- data.frame(
date = c("1999-12-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"),
value = c(0, 1, 1, 1, 0))
1) What we need to do is not roll by date but rather expand the series to fill in the missing dates giving zz and then use rollapply. Below do that by creating a zoo series (which also converts the dates to Date class) and then convert that to ts class. Because ts class can only represent regularly spaced series that conversion will fill in the missing dates and provide a value of NA for them. We can fill those in with 1 and then convert back to zoo with Date class index.
library(zoo)
z <- read.zoo(test_data2)
zz <- z |> as.ts() |> na.fill(1) |> as.zoo() |> aggregate(as.Date)
r <- rollapply(zz, 14, min, na.rm = TRUE, partial = TRUE, align = "left")
time(r)[which(r == 1)[1]]
## [1] "1999-12-02"
2) Another way to solve this not involving rollapply at all would be to use rle. Using zz from above
ok <- with(rle(coredata(zz)), rep(lengths >= 14 & values == 1, lengths))
tt[which(ok)[1]]
## [1] "1999-12-02"
3) Another way without using rollapply is to extract the 0 value rows and then keep only those whose difference exceeds 14 days from the next 0 value row. Finally take the first such row and use the date one day after it. This assumes that there is at least one 0 row before the first run of 14+ ones. Below we have returned back to using test_data from the question although this would have also worked with test_data2.
library(dplyr)
test_data %>%
mutate(date = as.Date(date)) %>%
filter(value == 0) %>%
mutate(diff = as.numeric(lead(date) - date)) %>%
filter(diff > 14) %>%
head(1) %>%
mutate(date = date + 1)
## date value diff
## 1 2000-01-02 0 17
rollapply over dates rather than points
4) The question also discussed using rollapply over dates rather than points which we address here. As noted above this does not actually solve the question of finding the first stretch of 14+ ones so instead we show how to find the first date in the series which starts a stretch of at least 14 ones. In general, we do this by first calculating a width vector using findInterval and then use rollapply in the usual way but with those widths rather than using a scalar width. This only involves one extra line of code to calculate the widths, w.
# using test_data from question
tt <- as.Date(test_data$date)
w <- findInterval(tt + 13, tt, rightmost.closed = TRUE) - seq_along(tt) + 1
r <- rollapply(test_data$value, w, min, fill = NA, na.rm = TRUE, align = "left")
tt[which(r == 1)[1]]
## [1] "2000-01-02"
There are further examples in ?rollapply showing how to roll by time rather than number of points.
sqldf
5) A completely different way of approaching the problem of finding the first 14+ ones with a date in the series is to use an SQL self join. It joins the first instance of test aliased to a to a second instance b associating all rows of b within the indicated date range and of a taking the minimum value of those creating a new column min14 with those minimums. The having clause then keeps only those rows for which min14 is 1 and of those the limit clause keeps the first. We then extract the date at the end.
library(sqldf)
test <- transform(test_data, date = as.Date(date))
sqldf("select a.*, min(b.value) min14
from test a
left join test b on b.date between a.date and a.date + 13
group by a.rowid
having min14 = 1
limit 1")$date
## [1] "2000-01-02"
You may look into runner package where you can pass k as days/weeks etc. See this example, to sum the last 3 days of value.
library(dplyr)
library(runner)
test_data %>%
mutate(date = as.Date(date),
sum_val = runner(value, k = "3 days", idx = date, f = sum))
# date value sum_val
#1 2000-01-01 0 0
#2 2000-01-02 1 1
#3 2000-01-03 1 2
#4 2000-01-17 1 1
#5 2000-01-18 0 1
Notice row 4 has value 1 (and not 3) because there is only 1 value that occurred in last 3 days.

Converting dataframe with multiple values for one date into a ts object in R

I have a large dataset with multiple values for specific days. There are missing values in the dataset as it's for a long period of time. Here's a small example:
set.seed(1)
data <- data.frame(
Date = sample(c("1993-07-09", "1993-07-09", "1993-07-10", "1993-08-11", "1993-08-11", "1993-08-11")),
Oxygen = sample(c(0.2, 0.4, 0.4, 0.2, 0.4, 0.5))
)
data$Date <- as.Date(data$Date)
I want to convert this dataframe into a ts object, so that I can forecast, use arima models, and eventually find outliers.
It specifically needs to be a ts object and not a xts object.
The problem I'm facing is:
1) I don't know how to convert a data frame into a ts object.
2) Create a ts object that allows for multiple values to take place for a single day.
Any help would be greatly appreciated. Thank you!
(1) mts ts objects must be regularly spaced (i.e. the same amount of time between each successive point) and can't represent dates (but we can use numbers) so we assume that the August dates were meant to be July so that we have consecutive dates and we use the number of days since the Epoch (January 1, 1970) as the time.
Add a sequence number to distinguish equal dates and split the series into multiple columns:
library(zoo)
data3 <- transform(data2, seq = ave(1:nrow(data2), Date, FUN = seq_along))
z <- read.zoo(data3, index = "Date", split = "seq")
as.ts(z)
giving:
Time Series:
Start = 8590
End = 8592
Frequency = 1
1 2 3
8590 0.5 0.4 NA
8591 0.4 NA NA
8592 0.2 0.2 0.4
(2) mean Alternately average the values on equal dates:
z2 <- read.zoo(data2, index = "Date", aggregate = mean)
as.ts(z2)
giving:
Time Series:
Start = 8590
End = 8592
Frequency = 1
[1] 0.4500000 0.4000000 0.2666667
(3) Ignore Date We could ignore the Date column (as the poster suggested) in which case we just use 1, 2, 3, ... as the time index:
ts(data$Oxygen)
(4) 1st point each month Since, in a comment, the poster indicated that there is a lot of data (20 years) we could take the first point in each month forming a monthly series.
as.ts(read.zoo(data, index = "Date", FUN = as.yearmon, aggregate = function(x) x[1]))
Note
August dates have been changed to July to form data2 above:
set.seed(1)
data2 <- data.frame(
Date = sample(c("1993-07-09", "1993-07-09", "1993-07-10",
"1993-07-11", "1993-07-11", "1993-07-11")),
Oxygen = sample(c(0.2, 0.4, 0.4, 0.2, 0.4, 0.5))
)
data2$Date <- as.Date(data$Date)

R How to Split given Time Periods in interval of 30 days in R

I have data with Order Id, Start Date & End Date. I have to split both the Start and End dates into intervals of 30 days, and derive two new variables “split start date” and “split end date”.
Example: The below example illustrates how split dates are created when the Start Date is “01/05/2017” and the End Date is “06/07/2017”
Suppose, an order have start and end dates as below
see the image for example
What is the code for this problem in R ?
Here is a solution which should generalize to multiple order id's. I have created a sample data with two order id's. The basic idea is to calculate the number of intervals between start_date and end_date. Then we repeat the row for each order id by the number of intervals, and also create a sequence to determine which interval we are in. This is the purpose of creating functions f and g and the use of Map.
The remaining is just vector manipulations where we define split_start_date and split_end_date. The last statement is to ensure that split_end_date does not exceed end_date.
df <- data.frame(
order_id = c(1, 2),
start_date = c(as.Date("2017-05-01"), as.Date("2017-08-01")),
end_date = c(as.Date("2017-07-06"), as.Date("2017-09-15"))
)
df$diff_days <- as.integer(df$end_date - df$start_date)
df$num_int <- ceiling(df$diff_days / 30)
f <- function(rowindex) {
rep(rowindex, each = df[rowindex, "num_int"])
}
g <- function(rowindex) {
1:df[rowindex, "num_int"]
}
rowindex_rep <- unlist(Map(f, 1:nrow(df)))
df2 <- df[rowindex_rep, ]
df2$seq <- unlist(Map(g, 1:nrow(df)))
df3 <- df2
df3$split_start_date <- df3$start_date + (df3$seq - 1) * 30
df3$split_end_date <- df3$split_start_date + 29
df3[which(df3$seq == df3$num_int), ]$split_end_date <-
df3[which(df3$seq == df3$num_int), ]$end_date

splitting in samples and operating on them

I am just beginning with R and I have a beginner's question.
I have the following data frame (simplified):
Time: 00:01:00 00:02:00 00:03:00 00:04:00 ....
Flow: 2 4 5 1 ....
I would like to know the mean flow every two minutes instead of every minute. I need this for many hours of data.
I want to save those new means in a list. How can I do this using an apply function?
I assume you have continuous data without gaps, with values for Flow for every minute.
In base R we can use aggregate:
df.out <- data.frame(Time = df[seq(0, nrow(df) - 1, 2) + 1, "Time"]);
df.out$mean_2min = aggregate(
df$Flow,
by = list(rep(seq(1, nrow(df) / 2), each = 2)),
FUN = mean)[, 2];
df.out;
# Time mean_2min
#1 00:01:00 3
#2 00:03:00 3
Explanation: Extract only the odd rows from df; aggregate values in column Flow by every 2 rows, and store the mean in column mean_2min.
Sample data
df <- data.frame(
Time = c("00:01:00", "00:02:00", "00:03:00", "00:04:00"),
Flow = c(2, 4, 5, 1))
You can create a new variable in your data by using rounding your time variable to the closest two minutes below, then use a data table function to calculate the mean for your new minutes.
In order to help you precisely, you're gonna have to point out how your data is set up. If, for instance, your data is set up like this:
dt = data.table(Time = c(0:3), Flow = c(2,4,5,1))
Then the following would work for you:
dt[, twomin := floor(Time/2)*2]
dt[, mean(Flow), by = twomin]

Optimising subsetting with for loop in R

I'm using R and RStudio to analyse GTFS public transport feeds and to create timetable range plots using ggplot2. The code currently works fine but is quite slow, which is problematic when working with very big CSVs as is often the case here.
The slowest part of the code is as follows (with some context): a for loop that iterates through the data frame and subsets each unique trip into a temporary data frame from which the extreme arrival and departure values (first & last rows) are extracted:
# Creates an empty df to contain trip_id, trip start and trip end times
Trip_Times <- data.frame(Trip_ID = character(), Departure = character(), Arrival = character(), stringsAsFactors = FALSE)
# Creates a vector containing all trips of the analysed day
unique_trips = unique(stop_times$trip_id)
# Iterates through stop_times for each unique trip_id and populates previously created data frame
for (i in seq(from = 1, to = length(unique_trips), by = 1)) {
temp_df <- subset(stop_times, trip_id == unique_trips[i])
Trip_Times[nrow(Trip_Times) + 1, ] <- c(temp_df$trip_id[[1]], temp_df$departure_time[[1]], temp_df$arrival_time[[nrow(temp_df)]])
}
The stop_times df looks as follows with some feeds containing over 2.5 million lines giving around 200k unique trips, hence 200k loop iterations...
head(stop_times)
trip_id arrival_time departure_time stop_sequence
1 011_0840101_A14 7:15:00 7:15:00 1
2 011_0840101_A14 7:16:00 7:16:00 2
3 011_0840101_A14 7:17:00 7:17:00 3
4 011_0840101_A14 7:18:00 7:18:00 4
5 011_0840101_A14 7:19:00 7:19:00 5
6 011_0840101_A14 7:20:00 7:20:00 6
Would anyone be able to advise me how to optimise this code in order to obtain faster results. I don't believe apply can be used here but I may well be wrong.
This should be straightforward with dplyr...
library(dplyr)
Trip_Times <- stop_times %>%
group_by(trip_id) %>%
summarise(departure_time=first(departure_time),
arrival_time=last(arrival_time))
We can use data.table
library(data.table)
setDT(stop_times)[, .(departure_time = departure_time[1L],
arrival_time = arrival_time[.N]) , by = trip_id]

Resources