Convert sub-hourly data to hourly and round up time in R - r

I have a very big dataframe in R, containing weather data with the following format.
valid temp
1 17/08/2014 00:20 14
2 17/08/2014 00:50 14
3 17/08/2014 01:20 13.5
4 17/08/2014 01:50 13
5 17/08/2014 02:20 12
6 17/08/2014 02:50 10
I would like to convert these sub-hourly data to hourly, like the following.
valid tmpc
1 2014-08-17 00:00:00 14
2 2014-08-17 01:00:00 13.75
3 2014-08-17 02:00:00 12.5
The class of df$valid is 'factor'. I have tried first converting them to Date through POSIXct, but it gives only NA values. I have also tried changing the system locale and still I get NAs.

We can do this in base R by converting to POSIXlt, set the minute to 0, convert it back to POSIXct and aggregate to get the mean of 'temp'
df1$valid <- strptime(df1$valid, "%d/%m/%Y %H:%M")
df1$valid$min <- 0
df1$valid <- as.POSIXct(df1$valid)
aggregate(temp~valid, df1, FUN = mean)

Option 1: The lubridate solution using ceiling_date or round_date. It's not clear according to your data frame and results if what you want is to round or ceiling. For instance, in the first row you are rounding and in the third using ceiling. Anyways here the example:
library(lubridate)
df <- data.frame(i = 1, valid= "17/08/2014 01:28", temp = 14)
df$valid <- dmy_hm(df$valid)
df$valid_round <- ceiling_date(df$valid , unit="hours")
Results:
i valid temp valid_round
1 1 2014-08-17 01:28:00 14 2014-08-17 02:00:00
Option 2: using the base functions. Use:
df$valid <- as.POSIXct(strptime(df$valid, "%d/%m/%Y %H:%M", tz ="UTC"))
and then round it.

Related

period.apply over an hour with deciding start time

So I have a xts time serie over the year with time zone "UTC". The time interval between each row is 15 minutes.
x1 x2
2014-12-31 23:15:00 153.0 0.0
2014-12-31 23:30:00 167.1 5.4
2014-12-31 23:45:00 190.3 4.1
2015-01-01 00:00:00 167.1 9.7
As I want data over one hour to allow for comparison with other data sets, I tried to use period.apply:
dat <- period.apply(dat, endpoints(dat,on="hours",k=1), colSums)
The problem is that the first row in my new data set is 2014-12-31 23:45:00 and not 2015-01-01 00:00:00. I tried changing the endpoint vector but somehow it keeps saying that it is out of bounds. I also thought this was my answer: https://stats.stackexchange.com/questions/5305/how-to-re-sample-an-xts-time-series-in-r/19003#19003 but it was not. I don't want to change the names of my columns, I want to sum over a different interval.
Here a reproducible example:
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
period.apply(xts, endpoints(xts,on="hours",k=1), colSums)
And the result looks like this:
2014-12-31 23:45:00 3
2015-01-01 00:45:00 4
2015-01-01 01:45:00 4
2015-01-01 02:45:00 4
and ends up like this:
2015-01-01 21:45:00 4
2015-01-01 22:45:00 4
2015-01-01 23:45:00 4
2015-01-02 00:00:00 1
Whereas I would like it to always sum over the same interval, meaning I would like only 4s.
(I am using RStudio 0.99.903 with R x64 3.3.2)
The problem is that you're using endpoints, but you want to align by the start of the interval, not the end. I thought you might be able to use this startpoints function, but that produced weird results.
The basic idea of the work-around below is to subtract a small amount from all index values, then use endpoints and period.apply to aggregate. Then call align.time on the result. I'm not sure if this is a general solution, but it seems to work for your example.
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
# create a temporary object
tmp <- xts
# subtract a small amount of time from each index value
.index(tmp) <- .index(tmp)-0.001
# aggregate to hourly
agg <- period.apply(tmp, endpoints(tmp, "hours"), colSums)
# round index up to next hour
agg_aligned <- align.time(agg, 3600)

How to make dataframe with large number of row in R [duplicate]

I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below:
timestamp tr tt sr st
1 9/1/01 0:00 1.018269e+02 -312.8622 -1959.393 4959.828
2 9/1/01 0:01 1.023567e+02 -313.0002 -1957.755 4958.935
3 9/1/01 0:02 1.018857e+02 -313.9406 -1956.799 4959.938
4 9/1/01 0:03 1.025463e+02 -310.9261 -1957.347 4961.095
5 9/1/01 0:04 1.010228e+02 -311.5469 -1957.786 4959.078
The problem I have is that some timestamp values are missing - e.g. there may be a gap between 9/1/01 0:13 and 9/1/01 0:27 and such gaps are irregular through the data set. I need to put several of these series into the same database and because the missing values are different for each series, the dates do not currently align on each row.
I would like to generate rows for these missing timestamps and fill the Y columns with blank values (no data, not zero), so that I have a continuous time series.
I'm honestly not quite sure where to start (not really used R before so learning as I go along!) but any help would be much appreciated. I have thus far installed chron and zoo, since it seems they might be useful.
Thanks!
This is an old question, but I just wanted to post a dplyr way of handling this, as I came across this post while searching for an answer to a similar problem. I find it more intuitive and easier on the eyes than the zoo approach.
library(dplyr)
ts <- seq.POSIXt(as.POSIXct("2001-09-01 0:00",'%m/%d/%y %H:%M'), as.POSIXct("2001-09-01 0:07",'%m/%d/%y %H:%M'), by="min")
ts <- seq.POSIXt(as.POSIXlt("2001-09-01 0:00"), as.POSIXlt("2001-09-01 0:07"), by="min")
ts <- format.POSIXct(ts,'%m/%d/%y %H:%M')
df <- data.frame(timestamp=ts)
data_with_missing_times <- full_join(df,original_data)
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 NA NA NA NA
7 09/01/01 00:06 NA NA NA NA
8 09/01/01 00:07 NA NA NA NA
Also using dplyr, this makes it easier to do something like change all those missing values to something else, which came in handy for me when plotting in ggplot.
data_with_missing_times %>% group_by(timestamp) %>% mutate_each(funs(ifelse(is.na(.),0,.)))
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 0 0 0 0
7 09/01/01 00:06 0 0 0 0
8 09/01/01 00:07 0 0 0 0
I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:
df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")
df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index
df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)
Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.
Date padding is implemented in the padr package in R. If you store your data frame, with your date-time variable stored as POSIXct or POSIXlt. All you need to do is:
library(padr)
pad(df_name)
See vignette("padr") or this blog post for its working.
I think this can accomplished by using complete in tidyr package.
library(tidyverse)
df <- df %>%
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "minute"),
tr, tt, sr,st)
you can also initialize your start date and end date instead of using min(timestamp) and max(timestamp).
# some made-up data
originaldf <- data.frame(timestamp=c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:04"),
tr = rnorm(4,0,1),
tt = rnorm(4,0,1))
originaldf$minAsPOSIX <- as.POSIXct(originaldf$timestamp, format="%m/%d/%y %H:%M", tz="GMT")
# Generate vector of all minutes
ndays <- 1 # number of days to generate
minAsNumeric <- 60*60*24*243 + seq(0,60*60*24*ndays,by=60)
# convert those minutes to POSIX
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="2001-01-01", tz="GMT")
# new df
newdf <- merge(data.frame(minAsPOSIX),originaldf,all.x=TRUE, by="minAsPOSIX")
In case you want to substitute the NA values acquired by any method mentioned above with zeroes, you can do this:
df[is.na(df)] <- 0
(I orginally wanted to comment this on Ibollar's answer but I lack the necessary reputation, thus I posted as an answer)
df1.zoo <- zoo(df1[,-1], as.POSIXlt(df1[,1], format = "%Y-%m-%d %H:%M:%S")) #set date to Index: Notice that column 1 is Timestamp type and is named as "TS"
full.frame.zoo <- zoo(NA, seq(start(df1.zoo), end(df1.zoo), by="min")) # zoo object
full.frame.df <- data.frame(TS = as.POSIXlt(index(full.frame.zoo), format = "%Y-%m-%d %H:%M:%S")) # conver zoo object to data frame
full.vancouver <- merge(full.frame.df, df1, all = TRUE) # merge
I was looking for something similar where instead of filling out missing timestamps my data was in months and days. So I wanted to generate a sequence of months that would cater for leap years et cetera. I used lubridate:
date <- df$timestamp[1]
date_list <- c(date)
while (date < df$timestamp[nrow(df)]){
date <- date %m+% months(1)
date_list <- c(date_list,date)
}
date_list <- format(as.Date(date_list),"%Y-%m-%d")
df_1 <- data.frame(months=date_list, stringsAsFactors = F)
This will give me a list of dates in incremental months. Then I join
df_with_missing_months <- full_join(df_1,df)
There are some advances in handling time series data in R, e.g. the tsibble package added such time series manipulations in tidy way:
library(tsibble)
library(lubridate)
ts <- lubridate::dmy_hm(c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:27"))
originaldf <- tsibble(timestamp = ts,
tr = rnorm(4,0,1),
tt = rnorm(4,0,1),
index = timestamp)
originaldf %>%
fill_gaps()

Calulcate running difference of time using difftime on one column of timestamps

How would you calculate time difference of two consecutive rows of timestamps in minutes and add the result to a new column.
I have tried this:
data$hours <- as.numeric(floor(difftime(timestamps(data), (timestamps(data)[1]), units="mins")))
But only get difference from time zero and onwards.
Added example data with 'mins' column that I want to be added
timestamps mins
2013-06-23 00:00:00 NA
2013-06-23 01:00:00 60
2013-06-23 02:00:00 60
2013-06-23 04:00:00 120
The code that you're using with the [1] is always referencing the first element of the timestamps vector.
To do what you want, you want to look at all but the first element minus all but the last element.
mytimes <- data.frame(timestamps=c("2013-06-23 00:00:00",
"2013-06-23 01:00:00",
"2013-06-23 02:00:00",
"2013-06-23 04:00:00"),
mins=NA)
mytimes$mins <- c(NA, difftime(mytimes$timestamps[-1],
mytimes$timestamps[-nrow(mytimes)],
units="mins"))
What this code does is:
Setup a data frame so that you will keep the length of the timestamps and mins the same.
Within that data frame, put the timestamps you have and the fact that you don't have any mins yet (i.e. NA).
Select all but the first element of timestamps mytimes$timestamps[-1]
Select all but the last element of timestamps mytimes$timestamps[-nrow(mytimes)]
Subtract them difftime (since they're well-formatted, you don't first have to make them POSIXct objects) with the units of minutes. units="mins"
Put an NA in front because you have one fewer difference than you have rows c(NA, ...)
Drop all of that back into the original data frame's mins column mytimes$mins <-
Another option is to calculate it with this approach:
# create some data for an MWE
hrs <- c(0,1,2,4)
df <- data.frame(timestamps = as.POSIXct(paste("2015-12-17",
paste(hrs, "00", "00", sep = ":"))))
df
# timestamps
# 1 2015-12-17 00:00:00
# 2 2015-12-17 01:00:00
# 3 2015-12-17 02:00:00
# 4 2015-12-17 04:00:00
# create a function that calculates the lag for n periods
lag <- function(x, n) c(rep(NA, n), x[1:(length(x) - n)])
# create a new column named mins
df$mins <- as.numeric(df$timestamps - lag(df$timestamps, 1)) / 60
df
# timestamps mins
# 1 2015-12-17 00:00:00 NA
# 2 2015-12-17 01:00:00 60
# 3 2015-12-17 02:00:00 60
# 4 2015-12-17 04:00:00 120

Combining date and time into a Date column for plotting

I want to create a line plot. I have 3 columns in my data frame:
date time numbers
01-02-2010 14:57 5
01-02-2010 23:23 7
02-02-2010 05:05 3
02-02-2010 10:23 11
How can I combine the first two columns and make a plot based on date and time ?
Date is Date class, time is just a char variable.
The lubridate package is another option. It handles most of the fussy formatting details, so it can be easier to use than base R date functions. For example, in your case, mdy_hm (month-day-year_hour-minute) will convert your date and time variables into a single POSIXct date-time column. (If you meant it to be day-month-year, rather than month-day-year, then just use dmy_hm.) See code below.
library(lubridate)
dat$date_time = mdy_hm(paste(dat$date, dat$time))
dat
date time numbers date_time
1 01-02-2010 14:57 5 2010-01-02 14:57:00
2 01-02-2010 23:23 7 2010-01-02 23:23:00
3 02-02-2010 05:05 3 2010-02-02 05:05:00
4 02-02-2010 10:23 11 2010-02-02 10:23:00
library(ggplot2)
ggplot(dat, aes(date_time, numbers)) +
geom_point() + geom_line() +
scale_x_datetime(breaks=date_breaks("1 week"),
minor_breaks=date_breaks("1 day"))
Reconstruct your data:
dat <- read.table(text="
date time numbers
01-02-2010 14:57 5
01-02-2010 23:23 7
02-02-2010 05:05 3
02-02-2010 10:23 11", header=TRUE)
Now use as.POSIXct() and paste() to combine your date and time into a POSIX date. You need to specify the format, using the symbols defined in ?strptime. Also see ?DateTimeClasses for more information
dat$newdate <- with(dat, as.POSIXct(paste(date, time), format="%m-%d-%Y %H:%M"))
plot(numbers ~ newdate, data=dat, type="b", col="blue")

Insert rows for missing dates/times

I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below:
timestamp tr tt sr st
1 9/1/01 0:00 1.018269e+02 -312.8622 -1959.393 4959.828
2 9/1/01 0:01 1.023567e+02 -313.0002 -1957.755 4958.935
3 9/1/01 0:02 1.018857e+02 -313.9406 -1956.799 4959.938
4 9/1/01 0:03 1.025463e+02 -310.9261 -1957.347 4961.095
5 9/1/01 0:04 1.010228e+02 -311.5469 -1957.786 4959.078
The problem I have is that some timestamp values are missing - e.g. there may be a gap between 9/1/01 0:13 and 9/1/01 0:27 and such gaps are irregular through the data set. I need to put several of these series into the same database and because the missing values are different for each series, the dates do not currently align on each row.
I would like to generate rows for these missing timestamps and fill the Y columns with blank values (no data, not zero), so that I have a continuous time series.
I'm honestly not quite sure where to start (not really used R before so learning as I go along!) but any help would be much appreciated. I have thus far installed chron and zoo, since it seems they might be useful.
Thanks!
This is an old question, but I just wanted to post a dplyr way of handling this, as I came across this post while searching for an answer to a similar problem. I find it more intuitive and easier on the eyes than the zoo approach.
library(dplyr)
ts <- seq.POSIXt(as.POSIXct("2001-09-01 0:00",'%m/%d/%y %H:%M'), as.POSIXct("2001-09-01 0:07",'%m/%d/%y %H:%M'), by="min")
ts <- seq.POSIXt(as.POSIXlt("2001-09-01 0:00"), as.POSIXlt("2001-09-01 0:07"), by="min")
ts <- format.POSIXct(ts,'%m/%d/%y %H:%M')
df <- data.frame(timestamp=ts)
data_with_missing_times <- full_join(df,original_data)
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 NA NA NA NA
7 09/01/01 00:06 NA NA NA NA
8 09/01/01 00:07 NA NA NA NA
Also using dplyr, this makes it easier to do something like change all those missing values to something else, which came in handy for me when plotting in ggplot.
data_with_missing_times %>% group_by(timestamp) %>% mutate_each(funs(ifelse(is.na(.),0,.)))
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 0 0 0 0
7 09/01/01 00:06 0 0 0 0
8 09/01/01 00:07 0 0 0 0
I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:
df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")
df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index
df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)
Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.
Date padding is implemented in the padr package in R. If you store your data frame, with your date-time variable stored as POSIXct or POSIXlt. All you need to do is:
library(padr)
pad(df_name)
See vignette("padr") or this blog post for its working.
I think this can accomplished by using complete in tidyr package.
library(tidyverse)
df <- df %>%
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "minute"),
tr, tt, sr,st)
you can also initialize your start date and end date instead of using min(timestamp) and max(timestamp).
# some made-up data
originaldf <- data.frame(timestamp=c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:04"),
tr = rnorm(4,0,1),
tt = rnorm(4,0,1))
originaldf$minAsPOSIX <- as.POSIXct(originaldf$timestamp, format="%m/%d/%y %H:%M", tz="GMT")
# Generate vector of all minutes
ndays <- 1 # number of days to generate
minAsNumeric <- 60*60*24*243 + seq(0,60*60*24*ndays,by=60)
# convert those minutes to POSIX
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="2001-01-01", tz="GMT")
# new df
newdf <- merge(data.frame(minAsPOSIX),originaldf,all.x=TRUE, by="minAsPOSIX")
In case you want to substitute the NA values acquired by any method mentioned above with zeroes, you can do this:
df[is.na(df)] <- 0
(I orginally wanted to comment this on Ibollar's answer but I lack the necessary reputation, thus I posted as an answer)
df1.zoo <- zoo(df1[,-1], as.POSIXlt(df1[,1], format = "%Y-%m-%d %H:%M:%S")) #set date to Index: Notice that column 1 is Timestamp type and is named as "TS"
full.frame.zoo <- zoo(NA, seq(start(df1.zoo), end(df1.zoo), by="min")) # zoo object
full.frame.df <- data.frame(TS = as.POSIXlt(index(full.frame.zoo), format = "%Y-%m-%d %H:%M:%S")) # conver zoo object to data frame
full.vancouver <- merge(full.frame.df, df1, all = TRUE) # merge
I was looking for something similar where instead of filling out missing timestamps my data was in months and days. So I wanted to generate a sequence of months that would cater for leap years et cetera. I used lubridate:
date <- df$timestamp[1]
date_list <- c(date)
while (date < df$timestamp[nrow(df)]){
date <- date %m+% months(1)
date_list <- c(date_list,date)
}
date_list <- format(as.Date(date_list),"%Y-%m-%d")
df_1 <- data.frame(months=date_list, stringsAsFactors = F)
This will give me a list of dates in incremental months. Then I join
df_with_missing_months <- full_join(df_1,df)
There are some advances in handling time series data in R, e.g. the tsibble package added such time series manipulations in tidy way:
library(tsibble)
library(lubridate)
ts <- lubridate::dmy_hm(c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:27"))
originaldf <- tsibble(timestamp = ts,
tr = rnorm(4,0,1),
tt = rnorm(4,0,1),
index = timestamp)
originaldf %>%
fill_gaps()

Resources