Converting irregular timestamp data to regularly spaced data using R - r

In a database I have data with associated timestamps. The timestamp is random in nature and has resolution upto minutes. I want to make this data uniform using R with respect to timestamps (with seconds resolution) with NA replaced by the previous value. Also, every timestamp should contain data for all the symbols. I have tried some timeseries packages for making the data uniform but have not been succcessful.
This is the code I have run so far
library("RPostgreSQL")
library(DBI)
library(sqldf)
drv <- dbDriver("PostgreSQL")
ch <- dbConnect(drv, dbname="derivativesData",
user="postgres", password="postgres")
companyFrame <- dbGetQuery(ch, "select * from derData")
companyFrame$trade_time
[1] "2011-06-01 09:00:00 IST" "2011-06-01 09:00:00 IST"
[3] "2011-06-01 09:00:00 IST" "2011-06-01 09:00:00 IST"
[5] "2011-06-01 09:00:00 IST" "2011-06-01 09:00:00 IST"
[7] "2011-06-01 09:00:00 IST" "2011-06-01 09:00:00 IST"
[9] "2011-06-01 09:00:00 IST" "2011-06-01 09:01:00 IST"
[11] "2011-06-01 09:01:00 IST" "2011-06-01 09:01:00 IST"
[13] "2011-06-01 09:02:00 IST" "2011-06-01 09:02:00 IST"
[15] "2011-06-01 09:02:00 IST" "2011-06-01 09:03:00 IST"
[17] "2011-06-01 09:04:00 IST" "2011-06-01 09:04:00 IST"
[19] "2011-06-01 09:05:00 IST" "2011-06-01 09:05:00 IST"
[21] "2011-06-01 09:06:00 IST" "2011-06-01 09:06:00 IST"
[23] "2011-06-01 09:06:00 IST" "2011-06-01 09:07:00 IST"
[25] "2011-06-01 09:08:00 IST" "2011-06-01 09:09:00 IST"
[27] "2011-06-01 09:10:00 IST" "2011-06-01 09:10:00 IST"
I want to convert this data into uniform format with say 10secs resolution.

Here I will use a 10 minutes resolution as your times don't have any seconds...
With the following sample data :
R> time <- c("2011-06-01 09:00:00 IST", "2011-06-01 09:00:00 IST", "2011-06-01 09:01:00 IST",
+ "2011-06-01 09:06:00 IST", "2011-06-01 09:10:00 IST", "2011-06-01 09:15:00 IST")
You can first convert the strings to a POSIXlt date format :
R> time2 <- strptime(time, format="%Y-%m-%d %X")
R> time2
[1] "2011-06-01 09:00:00" "2011-06-01 09:00:00" "2011-06-01 09:01:00"
[4] "2011-06-01 09:06:00" "2011-06-01 09:10:00" "2011-06-01 09:15:00"
Then you could use the minute function from the lubridate package to alter the minute components of your date and round it to a 10 minutes resolution, for example :
R> library(lubridate)
R> minute(time2) <- minute(time2) %/% 10 * 10
R> time2
[1] "2011-06-01 09:00:00 CEST" "2011-06-01 09:00:00 CEST"
[3] "2011-06-01 09:00:00 CEST" "2011-06-01 09:00:00 CEST"
[5] "2011-06-01 09:10:00 CEST" "2011-06-01 09:10:00 CEST"

Try the data.table package and it's roll=TRUE feature. See ?data.table and the vignettes where it talks about fast last observation carried forward.

Related

remove time from POSIXct Date

I have convert my date from chr to POSIXCT using formula below.
crime2$Date = parse_date_time(crime2$Date, orders = c('dmy_HM'),tz="UTC")
so my date actually now in this format.
> head(crime2$Date, 10)
[1] "2015-03-18 19:44:00 UTC" "2015-03-18 22:45:00 UTC"
[3] "2015-03-18 22:30:00 UTC" "2015-03-18 22:00:00 UTC"
[5] "2015-03-18 23:00:00 UTC" "2015-03-18 21:35:00 UTC"
[7] "2015-03-18 22:50:00 UTC" "2015-03-18 23:40:00 UTC"
[9] "2015-03-18 23:30:00 UTC" "2015-03-18 22:45:00 UTC"
However, if i want to remove the time and keep the date only, what can i do about this?
Example, they will look like this
" 2015-03-18 " "2015-03-18 "

How do I format 12 hour data with am/pm without NAs?

The reformatting works for just the date, but when I add in the am/pm data it doesn't work. Here is the code and data:
str(Hourly_Steps$activity_hour)
chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" ...
This worked to format just the date...
> day <- strptime(Hourly_Steps$activity_hour, format = "%m/%d/%Y %I:%M:%S")
> day
[1] "2016-04-12 00:00:00 IST" "2016-04-12 01:00:00 IST"
[3] "2016-04-12 02:00:00 IST" "2016-04-12 03:00:00 IST"
[5] "2016-04-12 04:00:00 IST" "2016-04-12 05:00:00 IST"
[7] "2016-04-12 06:00:00 IST" "2016-04-12 07:00:00 IST"
[9] "2016-04-12 08:00:00 IST" "2016-04-12 09:00:00 IST"
[11] "2016-04-12 10:00:00 IST" "2016-04-12 11:00:00 IST"
But when I try add in the %p for am/pm info, it goes back to 24hr...
> day <- strptime(Hourly_Steps$activity_hour, format = "%m/%d/%Y %I:%M:%S %p")
> day
[1] "2016-04-12 00:00:00 IST" "2016-04-12 01:00:00 IST"
[3] "2016-04-12 02:00:00 IST" "2016-04-12 03:00:00 IST"
[5] "2016-04-12 04:00:00 IST" "2016-04-12 05:00:00 IST"
[7] "2016-04-12 06:00:00 IST" "2016-04-12 07:00:00 IST"
[9] "2016-04-12 08:00:00 IST" "2016-04-12 09:00:00 IST"
[11] "2016-04-12 10:00:00 IST" "2016-04-12 11:00:00 IST"
[13] "2016-04-12 12:00:00 IST" "2016-04-12 13:00:00 IST"
[15] "2016-04-12 14:00:00 IST" "2016-04-12 15:00:00 IST"
What am I missing here?
Your code works correctly.
You are starting with a "character" vector, and strptime() correctly reads your imperial time into "POSIXt" time format. The "POSIXt" is only stored in this specific "YYYY-MM-DD HH:MM:SS TZ" format and this is what you see displayed.
x <- c("4/12/2016 1:00:00 AM", "4/12/2016 1:00:00 PM", "4/12/2016 12:00:00 AM",
"4/12/2016 12:00:00 PM")
y <- strptime(x, '%d/%m/%Y %I:%M:%S %p')
y
# [1] "2016-04-12 01:00:00 CEST" "2016-04-12 13:00:00 CEST"
# [3] "2016-04-12 00:00:00 CEST" "2016-04-12 12:00:00 CEST"
and
class(y)
# [1] "POSIXlt" "POSIXt"
Maybe you are looking for a way to change your output format, where you'd want to use strftime().
z <- strftime(y, '%m/%d/%Y %I:%M:%S %p')
z
# [1] "04/12/2016 01:00:00 am" "04/12/2016 01:00:00 pm"
# [3] "04/12/2016 12:00:00 am" "04/12/2016 12:00:00 pm"
Note, however, that you now have a "character" class.
class(z)
# [1] "character"
So, altogether you may want:
strftime(strptime(x, '%m/%d/%Y %I:%M:%S %p'), '%d/%m/%Y %I:%M:%S %p')
# [1] "12/04/2016 01:00:00 am" "12/04/2016 01:00:00 pm"
# [3] "12/04/2016 12:00:00 am" "12/04/2016 12:00:00 pm"
Sidenote: Midnight may not be displayed even though it is stored internally.
strptime("4/12/2016 12:00:00 AM", '%m/%d/%Y %I:%M:%S %p')
# [1] "2016-04-12 CEST"

lubridate:floor_date set reference start timestamp

I'm trying to floor continuous timestamps to 'every x hours' with lubridate:floor_date. However, when my time interval is greater than the hour of the first timestamp, it floors relative to midnight instead of my first timestamp. I have not found a way to set a reference timestamp for my start time. I have timestamps in UTC but need to floor them relative to for example 6:00 and 18:00 local time, which would be 12 hour intervals when referenced to local midnight, but doesn't work for UTC time when it keeps referencing to (UTC) midnight.
I know I could convert my timestamps to local time, but that is less than ideal. Is there a way to define the reference timestamps for floor_date that I'm missing?
Basically, what I'd like to do is floor the timestamps "every hour" relative to the start of my timeseries instead of each timestamp individually flooring relative to its midnight.
timestamps<-structure(c(1578628800, 1578632400, 1578636000, 1578639600, 1578643200,
1578646800, 1578650400, 1578654000, 1578657600, 1578661200), class = c("POSIXct",
"POSIXt"), tzone = "UTC")
floor_date(timestamps, '4 hours')
[1] "2020-01-10 04:00:00 UTC" "2020-01-10 04:00:00 UTC" "2020-01-10 04:00:00 UTC"
[4] "2020-01-10 04:00:00 UTC" "2020-01-10 08:00:00 UTC" "2020-01-10 08:00:00 UTC"
[7] "2020-01-10 08:00:00 UTC" "2020-01-10 08:00:00 UTC" "2020-01-10 12:00:00 UTC"
[10] "2020-01-10 12:00:00 UTC"
floor_date(timestamps, '5 hours')
[1] "2020-01-10 00:00:00 UTC" "2020-01-10 05:00:00 UTC" "2020-01-10 05:00:00 UTC"
[4] "2020-01-10 05:00:00 UTC" "2020-01-10 05:00:00 UTC" "2020-01-10 05:00:00 UTC"
[7] "2020-01-10 10:00:00 UTC" "2020-01-10 10:00:00 UTC" "2020-01-10 10:00:00 UTC"
[10] "2020-01-10 10:00:00 UTC"
Try the clock package:
clock::date_floor(timestamps, 'hour', n = 4)
[1] "2020-01-10 04:00:00 UTC" "2020-01-10 04:00:00 UTC"
[3] "2020-01-10 04:00:00 UTC" "2020-01-10 04:00:00 UTC"
[5] "2020-01-10 08:00:00 UTC" "2020-01-10 08:00:00 UTC"
[7] "2020-01-10 08:00:00 UTC" "2020-01-10 08:00:00 UTC"
[9] "2020-01-10 12:00:00 UTC" "2020-01-10 12:00:00 UTC"
clock::date_floor(timestamps, 'hour', n = 5)
[1] "2020-01-10 01:00:00 UTC" "2020-01-10 01:00:00 UTC"
[3] "2020-01-10 06:00:00 UTC" "2020-01-10 06:00:00 UTC"
[5] "2020-01-10 06:00:00 UTC" "2020-01-10 06:00:00 UTC"
[7] "2020-01-10 06:00:00 UTC" "2020-01-10 11:00:00 UTC"
[9] "2020-01-10 11:00:00 UTC" "2020-01-10 11:00:00 UTC"

R - How to calculate in a new column the difference in seconds between the first and the remaining dates

I have the following dates and I want to calculate the difference between the first date and the other dates. e.g. The difference must be date 2- date 1, date 3 - date 1 etc, in seconds and in another column.
Any help is appreciated I am new in R.
"2009-06-01 16:00:00 UTC"
"2009-06-29 16:00:00 UTC"
"2009-06-29 17:00:00 UTC"
"2009-06-30 16:00:00 UTC"
"2009-06-30 17:00:00 UTC"
"2009-06-30 18:00:00 UTC"
"2009-06-30 19:00:00 UTC"
"2009-07-01 08:00:00 UTC"
"2009-07-01 09:00:00 UTC"
"2009-07-01 10:00:00 UTC"
"2009-07-01 16:00:00 UTC"
"2009-07-01 17:00:00 UTC"
"2009-07-01 18:00:00 UTC"
"2009-07-01 19:00:00 UTC"
"2009-07-02 08:00:00 UTC"
"2009-07-02 09:00:00 UTC"
"2009-07-02 10:00:00 UTC"
"2009-07-02 16:00:00 UTC"
"2009-07-02 17:00:00 UTC"
"2009-07-02 18:00:00 UTC"
"2009-07-02 19:00:00 UTC"
"2009-07-04 10:00:00 UTC"
"2009-07-04 16:00:00 UTC"
"2009-07-04 17:00:00 UTC"
"2010-06-22 16:00:00 UTC"
"2010-06-22 17:00:00 UTC"
"2010-06-22 18:00:00 UTC"
"2010-08-20 16:00:00 UTC"
"2011-06-02 16:00:00 UTC"
"2011-06-02 17:00:00 UTC"
"2011-06-02 18:00:00 UTC"
"2011-06-03 10:00:00 UTC"
"2011-06-03 16:00:00 UTC"
"2011-06-03 17:00:00 UTC"
"2011-06-03 18:00:00 UTC"
"2011-06-03 19:00:00 UTC"
First you'll want to convert your character strings to dates. Once you've done this, you can easily use difftime() to calculate time distances.
There are a number of packages that help you with this and even more ways to do so. So in addition to the answer provided using the lubridate package, here is a way to solve it in base R:
# (I'll assume your data is saved in a vector called my_dates)
my_dates <- gsub(" UTC", "", my_dates) # removes " UTC" from all your dates (for no reason, see edit below)
my_dates <- as.POSIXlt(df$date) # converts to date format
difftime(time1 = my_dates, time2 = my_dates[1], units = "sec")
Time differences in secs
# [1] 0 2419200 2422800 2505600 2509200 2512800 2516400 2563200 2566800 2570400 2592000 2595600
# [13] 2599200 2602800 2649600 2653200 2656800 2678400 2682000 2685600 2689200 2829600 2851200 2854800
# [25] 33350400 33354000 33357600 38448000 63158400 63162000 63165600 63223200 63244800 63248400 63252000 63255600
Note: In my initial answer, I used as.Date.character(), but this ignored the times after the dates! as.Date() also ignores the time and only focuses on the dates. POSIXlt() does the job and keeps both the times and the dates.
Edit from comment: Apparently difftime() is clever enough to recognise strings as dates and automatically gets the right format for the dates, too!:
difftime(my_dates, my_dates[1], units = "secs")
# Time differences in secs
# [1] 0 2419200 2422800 2505600 2509200 2512800 2516400 2563200 # 2566800 2570400 2592000 2595600
# [13] 2599200 2602800 2649600 2653200 2656800 2678400 2682000 2685600 2689200 2829600 2851200 2854800
# [25] 33350400 33354000 33357600 38448000 63158400 63162000 63165600 63223200 63244800 63248400 63252000 63255600
The lubridate package is your friend in this scenario:
library(lubridate)
d <- read.table(text='"2009-06-01 16:00:00 UTC"
"2009-06-29 16:00:00 UTC"
"2009-06-29 17:00:00 UTC"
"2009-06-30 16:00:00 UTC"
"2009-06-30 17:00:00 UTC"
"2009-06-30 18:00:00 UTC"
"2009-06-30 19:00:00 UTC"
"2009-07-01 08:00:00 UTC"
"2009-07-01 09:00:00 UTC"
"2009-07-01 10:00:00 UTC"
"2009-07-01 16:00:00 UTC"
"2009-07-01 17:00:00 UTC"
"2009-07-01 18:00:00 UTC"
"2009-07-01 19:00:00 UTC"
"2009-07-02 08:00:00 UTC"
"2009-07-02 09:00:00 UTC"
"2009-07-02 10:00:00 UTC"
"2009-07-02 16:00:00 UTC"
"2009-07-02 17:00:00 UTC"
"2009-07-02 18:00:00 UTC"
"2009-07-02 19:00:00 UTC"
"2009-07-04 10:00:00 UTC"
"2009-07-04 16:00:00 UTC"
"2009-07-04 17:00:00 UTC"
"2010-06-22 16:00:00 UTC"
"2010-06-22 17:00:00 UTC"
"2010-06-22 18:00:00 UTC"
"2010-08-20 16:00:00 UTC"
"2011-06-02 16:00:00 UTC"
"2011-06-02 17:00:00 UTC"
"2011-06-02 18:00:00 UTC"
"2011-06-03 10:00:00 UTC"
"2011-06-03 16:00:00 UTC"
"2011-06-03 17:00:00 UTC"
"2011-06-03 18:00:00 UTC"
"2011-06-03 19:00:00 UTC"', stringsAsFactors=FALSE)
d <- ymd_hms(d[, 1])
sapply(d, function(x) x-d)

Generate a working day sequence in R

I want to generate a working week / working day sequence (Monday-Friday; 8am - 5pm) in R. However I only figured out how to extract a working week (Monday-Friday) with 24 hours.
library(timeDate)
start <- as.POSIXct("2010-01-01")
interval <- 60
seq_1 <- as.timeDate(seq(from=start, by=interval*60, length.out = 200))
seq_2 <- seq_1[isWeekday(seq_1)]; seq_2
dayOfWeek(seq_2)
Is there a similar function which can extract only working hours? Thanks
You can use function format to obtain hours
seq_2[as.numeric(format(seq_2,'%H')) %in% 8:15 ]
Select weekdays and then repeat with frequency equal to the desired hours. I'm afraid I missed your 8 o;clock start and used the phrase "9 to 5" as my guide:
twoyears <- seq.Date(as.Date("2010-01-01"), by='day', length.out=365*2)
twoworkyrs <- twoyears[isWeekday(twoyears, wday = 1:5)]
twoworkyrs[ 1:10]
# [1] "2010-01-01" "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" "2010-01-08"
# [7] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14"
workhours <- as.POSIXct( as.numeric(rep(twoworkyrs, each=9))*24*3600 + # weekdays
(9:17)*3600 , n # working hours
origin="1970-01-01", tz="America/LosAngeles")
#----- First two weeks ----------------
> workhours[1:90]
[1] "2010-01-01 09:00:00 UTC" "2010-01-01 10:00:00 UTC" "2010-01-01 11:00:00 UTC"
[4] "2010-01-01 12:00:00 UTC" "2010-01-01 13:00:00 UTC" "2010-01-01 14:00:00 UTC"
[7] "2010-01-01 15:00:00 UTC" "2010-01-01 16:00:00 UTC" "2010-01-01 17:00:00 UTC"
[10] "2010-01-04 09:00:00 UTC" "2010-01-04 10:00:00 UTC" "2010-01-04 11:00:00 UTC"
[13] "2010-01-04 12:00:00 UTC" "2010-01-04 13:00:00 UTC" "2010-01-04 14:00:00 UTC"
[16] "2010-01-04 15:00:00 UTC" "2010-01-04 16:00:00 UTC" "2010-01-04 17:00:00 UTC"
[19] "2010-01-05 09:00:00 UTC" "2010-01-05 10:00:00 UTC" "2010-01-05 11:00:00 UTC"
[22] "2010-01-05 12:00:00 UTC" "2010-01-05 13:00:00 UTC" "2010-01-05 14:00:00 UTC"
[25] "2010-01-05 15:00:00 UTC" "2010-01-05 16:00:00 UTC" "2010-01-05 17:00:00 UTC"
[snipped
I must admit that timezone conversions are one of my weakest suits.

Resources