Calulcate running difference of time using difftime on one column of timestamps - r

How would you calculate time difference of two consecutive rows of timestamps in minutes and add the result to a new column.
I have tried this:
data$hours <- as.numeric(floor(difftime(timestamps(data), (timestamps(data)[1]), units="mins")))
But only get difference from time zero and onwards.
Added example data with 'mins' column that I want to be added
timestamps mins
2013-06-23 00:00:00 NA
2013-06-23 01:00:00 60
2013-06-23 02:00:00 60
2013-06-23 04:00:00 120

The code that you're using with the [1] is always referencing the first element of the timestamps vector.
To do what you want, you want to look at all but the first element minus all but the last element.
mytimes <- data.frame(timestamps=c("2013-06-23 00:00:00",
"2013-06-23 01:00:00",
"2013-06-23 02:00:00",
"2013-06-23 04:00:00"),
mins=NA)
mytimes$mins <- c(NA, difftime(mytimes$timestamps[-1],
mytimes$timestamps[-nrow(mytimes)],
units="mins"))
What this code does is:
Setup a data frame so that you will keep the length of the timestamps and mins the same.
Within that data frame, put the timestamps you have and the fact that you don't have any mins yet (i.e. NA).
Select all but the first element of timestamps mytimes$timestamps[-1]
Select all but the last element of timestamps mytimes$timestamps[-nrow(mytimes)]
Subtract them difftime (since they're well-formatted, you don't first have to make them POSIXct objects) with the units of minutes. units="mins"
Put an NA in front because you have one fewer difference than you have rows c(NA, ...)
Drop all of that back into the original data frame's mins column mytimes$mins <-

Another option is to calculate it with this approach:
# create some data for an MWE
hrs <- c(0,1,2,4)
df <- data.frame(timestamps = as.POSIXct(paste("2015-12-17",
paste(hrs, "00", "00", sep = ":"))))
df
# timestamps
# 1 2015-12-17 00:00:00
# 2 2015-12-17 01:00:00
# 3 2015-12-17 02:00:00
# 4 2015-12-17 04:00:00
# create a function that calculates the lag for n periods
lag <- function(x, n) c(rep(NA, n), x[1:(length(x) - n)])
# create a new column named mins
df$mins <- as.numeric(df$timestamps - lag(df$timestamps, 1)) / 60
df
# timestamps mins
# 1 2015-12-17 00:00:00 NA
# 2 2015-12-17 01:00:00 60
# 3 2015-12-17 02:00:00 60
# 4 2015-12-17 04:00:00 120

Related

Thoughts on how to speed up replacing a column value when conditions between two objects (dataframe or datatable) are met in a loop?

I'm trying to get some input on how I might speed up a for loop that I've written. Essentially, I have a dataframe (DF1) where each row provides the latitude and longitude/point at a given time. The variables therefore are the lat and long for the point and the timestamp (date and time object). In addition, each row is a unique timestamp, for a unique point (in other word no repeats).
I'm trying to match it up to weather data which is contained in an netcdf file. The hourly weather data I have is a 3 dimensional file that includes the lats and longs for grids, time stamps, and the value for that weather variable. I'll call the weather variable u for now.
What I want to end up with: I have a column for the u values in DF1. It starts out with only missing values. In the end I want to replace those missing values in DF1 with the appropriate u value from the netcdf file.
How I've done this so far: I have constructed a for loop that can extract the appropriate u values for the nearest grid to each point in DF1. For example, for lat x and long y in DF1 I find the nearest grid in the netcdf file and extract the full timeseries. Then, within the for loop, I amend the DF1$u values with the appropriate data. DF1 is a rather large dataframe and the netcdf is even bigger (DF1 is just a small subset of the full netcdf).
Some psuedo data:
ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021 01:00:00"
, "1/5/2021 04:00:00"
, '2/1/2021 06:00:00'
, "1/7/2021 01:00:00"
, "2/2/2021 01:00:00"
, "2/5/2021 02:00:00")
lat <- c(34,36,41,50,20,40)
long <- c(55,50,‑89,-175,-155,25)
DF1 <- data.frame(ID, datetime, lat, long)
DF1$u <- NA
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 NA
2 2 1/5/2021 04:00:00 36 50 NA
3 3 2/1/2021 06:00:00 41 -89 NA
4 4 1/7/2021 01:00:00 50 -175 NA
5 5 2/2/2021 01:00:00 20 -155 NA
6 6 2/5/2021 02:00:00 40 25 NA
Here is an example of the type of for loop I've constructed, I've left out some of the more specific details that aren't relevant:
for(i in 1:nrows(DF1) {
### a number of steps here that identify the closest grid to each point. ####
mat_index <- as.vector(arrayInd(which.min(dist_mat), dim(dist_mat)))
# Extract the time series for the lat and long that are closest. u is the weather variable, time is the datetime variable, and geometry is the corresponding lat long list item.
df_u <- data.frame(u=data_u[mat_index[2],mat_index[1],],time=timestamp,geometry=rep(psj[i,1],dim(data_u)[3]))
# To make things easier I seperate geometry into a lat and a long col
df_u <- tidyr::separate(df_u, geometry, into = c("long", "lat"), sep = ",")
df_u$long <- gsub("c\\(", "", df_u$long)
df_u$lat <- gsub("\\)", "", df_u$lat)
# I find that datatables are a bit easier for these types of tasks, so I set the full timeseries data and DF1 to data table (in reality I set df1 as a data table outside of the for loop)
df_u <- setDT(df_u)
# Then I use merge to join the two datatables, replace the missing value with the appropriate value from df_u, and then drop all the unnecessary columns in the final step.
df_u_new <- merge(windu_df, df_u, by =c("lat","long","time"), all.x = T)
df_u_new[, u := ifelse(is.na(u.x), u.y, u.x)]
windu_df <- df_u_new[, .(time, lat, long, u)]
}
This works, but given the sheer size of the dataframe/datatables that I'm working with, I wonder if there is a faster way to do that last step in particular. I know merge is probably the slowest way to do this, but I kept running into issues using match() and inner_join.
Unfortunately I'm not able to really give fully reproduceable data here given that I'm working with a netcdf file, but df_u looks something like this for the first iteration:
ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021 01:00:00"
, "1/1/2021 02:00:00"
, "1/1/2021 03:00:00"
, "1/1/2021 04:00:00"
, "1/1/2021 05:00:00"
, "1/1/2021 06:00:00")
lat <- c(34,34,34,34,34,34)
long <- c(55,55,55,55,55,55)
u <- c(2.8,3.6,2.1,5.6,4.4,2,5)
df_u <- data.frame(ID, datetime, lat, long,u)
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 2.8
2 2 1/1/2021 02:00:00 34 55 3.6
3 3 1/1/2021 03:00:00 34 55 2.1
4 4 1/1/2021 04:00:00 34 55 5.6
5 5 1/1/2021 05:00:00 34 55 4.4
6 6 1/1/2021 06:00:00 34 55 2.5
Once u is amended in DF1 it should read:
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 2.8
2 2 1/5/2021 04:00:00 36 50 NA
3 3 2/1/2021 06:00:00 41 -89 NA
4 4 1/7/2021 01:00:00 50 -175 NA
5 5 2/2/2021 01:00:00 20 -155 NA
6 6 2/5/2021 02:00:00 40 25 NA
In the next iteration, the for loop will extract the relevant weather data for the lat and long in row 2 and then retain 2.8 and replace the NA in row 2 with a value.
EDIT: The NETCDF data covers an entire year (so every hour for a year) for a decently large spatial area. It's ERA5 data. In addition, DF1 had thousands of unique lat/long and timestamp observations.

Sorting a set of data for every 3 hours in r

I do have 2 datasets per 10 minutes on 34 years. In one of them, observations are made only every 3 hours and I would like to keep only the lines with those observations. It starts at midnight (included) and goes like: 3am, 6am, 9am etc.
Looks like this:
stn CODES time1 pcp_type
1 SIO - 1981-01-01 02:00:00 <NA>
2 SIO - 1981-01-01 02:10:00 <NA>
3 SIO - 1981-01-01 02:20:00 <NA>
4 SIO - 1981-01-01 02:30:00 <NA>
5 SIO - 1981-01-01 02:40:00 <NA>
6 SIO - 1981-01-01 02:50:00 <NA>
Now the idea would be to keep only lines which corresponds to every 3 hours and deleting the rest.
I saw some solution about sorting by value (e.g. is bigger than) but I didn't find a solution that could help me sort by hour ( %H == 3 etc).
Thank you in advance.
I've already sorted my time column as following:
SYNOP_SION$time1<-as.POSIXct(strptime(as.character(SYNOP_SION$time),format = "%Y%m%d%H%M"), tz="UTC")
Here is an example with a vector:
# Creating sample time data
time1 <- seq(from = Sys.time(), length.out = 96, by = "hours")
# To get a T/F vector you can use to filter
as.integer(format(time1, "%H")) %in% seq.int(0, 21, 3)
# To see the filtered POSIXct vector:
time1[as.integer(format(time1, "%H")) %in% seq.int(0, 21, 3)]

Match given time with different time period then return the value in R

I have a data frame in R with 3 columns: start time, end time, and value
start_time end_time value
2017-01-03 00:00 2017-01-03 00:05 90
2017-01-03 00:05 2017-01-03 00:10 89
2017-01-03 00:10 2017-01-03 00:15 33
2017-01-03 00:15 2017-01-03 00:20 55
another data frame is a list of time. For each time in the list, if the time is between start time and end time, return the value, if not return NA. Is there a way I can do this use foreach function?
You could do it using the base R cut.POSIXt() function. If your vector of times is e.g.:
library(lubridate)
df2 <- ymd_hm(c("2017-01-02 00:00", "2017-01-03 00:02", "2017-01-03 00:04", "2017-01-03 00:12")) # note that first one is not included
break2 <- unique(c(df$start_time, df$end_time))
cut.POSIXt(df2, breaks = break2, labels = c(df$value), right = TRUE)
[1] <NA> 90 90 33
NB:
In this case, the breaks are easily obtained from your data frame as your intervals are mutually exclusive.
In in this example the intervals are closed on the right (right = TRUE)
In your case you don't define what will happen with values that are contained by two intervals e.g.: 2017-01-03 00:05.

period.apply over an hour with deciding start time

So I have a xts time serie over the year with time zone "UTC". The time interval between each row is 15 minutes.
x1 x2
2014-12-31 23:15:00 153.0 0.0
2014-12-31 23:30:00 167.1 5.4
2014-12-31 23:45:00 190.3 4.1
2015-01-01 00:00:00 167.1 9.7
As I want data over one hour to allow for comparison with other data sets, I tried to use period.apply:
dat <- period.apply(dat, endpoints(dat,on="hours",k=1), colSums)
The problem is that the first row in my new data set is 2014-12-31 23:45:00 and not 2015-01-01 00:00:00. I tried changing the endpoint vector but somehow it keeps saying that it is out of bounds. I also thought this was my answer: https://stats.stackexchange.com/questions/5305/how-to-re-sample-an-xts-time-series-in-r/19003#19003 but it was not. I don't want to change the names of my columns, I want to sum over a different interval.
Here a reproducible example:
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
period.apply(xts, endpoints(xts,on="hours",k=1), colSums)
And the result looks like this:
2014-12-31 23:45:00 3
2015-01-01 00:45:00 4
2015-01-01 01:45:00 4
2015-01-01 02:45:00 4
and ends up like this:
2015-01-01 21:45:00 4
2015-01-01 22:45:00 4
2015-01-01 23:45:00 4
2015-01-02 00:00:00 1
Whereas I would like it to always sum over the same interval, meaning I would like only 4s.
(I am using RStudio 0.99.903 with R x64 3.3.2)
The problem is that you're using endpoints, but you want to align by the start of the interval, not the end. I thought you might be able to use this startpoints function, but that produced weird results.
The basic idea of the work-around below is to subtract a small amount from all index values, then use endpoints and period.apply to aggregate. Then call align.time on the result. I'm not sure if this is a general solution, but it seems to work for your example.
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
# create a temporary object
tmp <- xts
# subtract a small amount of time from each index value
.index(tmp) <- .index(tmp)-0.001
# aggregate to hourly
agg <- period.apply(tmp, endpoints(tmp, "hours"), colSums)
# round index up to next hour
agg_aligned <- align.time(agg, 3600)

R Search for a particular time from index

I use an xts object. The index of the object is as below. There is one for every hour of the day for a year.
"2011-01-02 18:59:00 EST"
"2011-01-02 19:58:00 EST"
"2011-01-02 20:59:00 EST"
In columns are values associated with each index entry. What I want to do is calculate the standard deviation of the value for all Mondays at 18:59 for the complete year. There should be 52 values for the year.
I'm able to search for the day of the week using the weekdays() function, but my problem is searching for the time, such as 18:59:00 or any other time.
You can do this by using interaction to create a factor from the combination of weekdays and .indexhour, then use split to select the relevant observations from your xts object.
set.seed(21)
x <- .xts(rnorm(1e4), seq(1, by=60*60, length.out=1e4))
groups <- interaction(weekdays(index(x)), .indexhour(x))
output <- lapply(split(x, groups), function(x) c(count=length(x), sd=sd(x)))
output <- do.call(rbind, output)
head(output)
# count sd
# Friday.0 60 1.0301030
# Monday.0 59 0.9204670
# Saturday.0 60 0.9842125
# Sunday.0 60 0.9500347
# Thursday.0 60 0.9506620
# Tuesday.0 59 0.8972697
You can use the .index* family of functions (don't forget the '.' in front of 'index'!):
fxts[.indexmon(fxts)==0] # its zero-based (!) and gives you all the January values
fxts[.indexmday(fxts)==1] # beginning of month
fxts[.indexwday(SPY)==1] # Mondays
require(quantmod)
> fxts
value
2011-01-02 19:58:00 1
2011-01-02 20:59:00 2
2011-01-03 18:59:00 3
2011-01-09 19:58:00 4
2011-01-09 20:59:00 5
2011-01-10 18:59:00 6
2011-01-16 18:59:00 7
2011-01-16 19:58:00 8
2011-01-16 20:59:00 9`
fxts[.indexwday(fxts)==1] #this gives you all the Mondays
for subsetting the time you use
fxts["T19:30/T20:00"] # this will give you the time period you are looking for
and here you combine weekday and time period
fxts["T18:30/T20:00"] & fxts[.indexwday(fxts)==1] # to get a logical vector or
fxts["T18:30/T21:00"][.indexwday(fxts["T18:30/T21:00"])==1] # to get the values
> value
2011-01-03 18:58:00 3
2011-01-10 18:59:00 6

Resources