I have a regular 5 minute interval datetime data sets (about 50). POSIXt/ lubridate functions convert my datetime very nicely to a 24 hour format as required. But I would like to add another column with my day's definition to be from 6 am to 6 am (which is currently midnight to midnight). I am trying to do this to capture after 12AM activity as a part of current date rather than the next one.
I am currently trying to create a group every 288th row (there are 288 5minute intervals in a day). But it creates a problem because my datasets don't necessarily start at a unique time.
I do not want to create offsets because that tampers with the values corresponding to the time.
Any efficient ways around this problem? Thank you.
You can efficiently do it by first generating a sequence of date/times, then using cut to find the bin in which each value falls:
set.seed(2)
dat <- Sys.time() + sort(runif(10, min=0, max=5*24*60*60))
dat
# [1] "2017-07-29 15:43:10 PDT" "2017-07-29 20:23:12 PDT" "2017-07-29 22:24:22 PDT" "2017-07-31 08:22:57 PDT"
# [5] "2017-07-31 18:13:06 PDT" "2017-07-31 21:01:10 PDT" "2017-08-01 12:30:19 PDT" "2017-08-02 04:14:03 PDT"
# [9] "2017-08-02 17:26:14 PDT" "2017-08-02 17:28:52 PDT"
sixs <- seq(as.POSIXct("2017-07-29 06:00:00", tz = "UTC"), as.POSIXct("2017-08-03 06:00:00", tz = "UTC"), by = "day")
sixs
# [1] "2017-07-29 06:00:00 UTC" "2017-07-30 06:00:00 UTC" "2017-07-31 06:00:00 UTC" "2017-08-01 06:00:00 UTC"
# [5] "2017-08-02 06:00:00 UTC" "2017-08-03 06:00:00 UTC"
cut(dat, sixs, label = FALSE)
# [1] 1 1 1 3 3 3 4 5 5 5
According to the help page (?seq.POSIXt), you might choose by="DSTday" instead.
Checkout this question and the corresponding answer: How to manipulate the time part of a date column?
It illustrates a more robust solution as it is independent of your data structure (e.g. repeatition).
Following #meenaparam's solution:
Convert all date columns to dmy_hms format from lubridate package. Please explore other options like dmy_hm or ymd_hms etc, as per your specific need.
mutate(DATE = dmy_hms(DATE))
Now create a column to identify the data points that need to be modified in different ways. Like your data points with 00:00:00 to 05:59:59 (hms) needs to be part of the previous date.
DAY_PAST = case_when(hour(DATE) < 6 ~ "yup", TRUE ~ "nope"))
Now convert the day value of these "yup" dates to day(DATE)-1
NEW_DATE = case_when(DAY_PAST == "yup"
~ make_datetime(year(DATE-86400), month(DATE-86400), day = day(DATE-86400), hour = hour(DATE)),
TRUE ~ DATE)
.
Related
I have a list of time
df$Interval = cut(as.POSIXct(df$time1,format="%H:%M:%S",tz="UTC",origin="1970-01-01"),
breaks=as.POSIXct(c("2021-03-25 00:00:00","2021-03-25 07:59:59",
"2021-03-25 15:59:59","2021-03-25 23:59:59"), tz="UTC"),
labels=c("First Tour","Second Tour","Third Tour"))
I have a column of time
time1|
"05:06:00"
"23:10:00"
"04:05:00"
"22:12:00"
"09:06:12"
The script works but i have to keep changing the date every day because
as.POSIXct(df$time1,format="%H:%M:%S",tz="UTC",origin="1970-01-01")
turns the time into
time1|
"2021-03-26 05:06:00 UTC"
"2021-03-26 23:10:00 UTC"
"2021-03-26 04:05:00 UTC"
"2021-03-26 22:12:00 UTC"
"2021-03-26 09:06:12 UTC"
So either solution is fine. Is there a way to run an interval with just time in "%H:%M:%S and i don't have to worry about date, or is there a way to add a standard date that would just be the same regardless of date for example
time1|
"1990-01-01 05:06:00 UTC"
"1990-01-01 23:10:00 UTC"
"1990-01-01 04:05:00 UTC"
"1990-01-01 22:12:00 UTC"
"1990-01-01 09:06:12 UTC"
Ultimately my result should be
time1|time interval
"05:06:00" first tour
"23:10:00" third tour
"04:05:00" first tour
"22:12:00" second tour
"09:06:12" third tour
You are using date-time classes, so switch to time objects with chron::times
time1 <- c(
"05:06:00"
, "23:10:00"
, "04:05:00"
, "22:12:00"
, "09:06:12"
)
df <- data.frame(time1=time1)
df$time_interval <- cut(chron::times(df$time1),
breaks=chron::times(c(
"00:00:00"
,"07:59:59"
,"15:59:59"
,"23:59:59")
)
, labels=c(
"First Tour"
,"Second Tour"
,"Third Tour")
)
> df
time1 time_interval
1 05:06:00 First Tour
2 23:10:00 Third Tour
3 04:05:00 First Tour
4 22:12:00 Third Tour
5 09:06:12 Second Tour
HTH
I have a csv file that consists of one column. The column presents the date of posting on a website. I want to plot a histogram to see how the number of posts varies over the years. The file contains the years (2012 to 2016) and consists of 11,000 rows.
sample of the file:
2 30/1/12 21:07
3 2/2/12 15:53
4 3/4/12 0:49
5 14/11/12 3:49
6 11/8/13 16:00
7 31/7/14 8:08
8 31/7/14 10:48
9 6/8/14 9:24
10 16/12/14 3:34
The data types is dataframe
class(postsData)
[1] "data.frame"
I tried converting the data to text using strptime function as below:
formatDate <- strptime(as.character(postsData$Date),format="“%d/%m/%y")
then plot the histogram
hist(formatDate,breaks=10,xlab="year")
Any tip or suggestion would be useful. Thank you,
use lubridate::dmy_hm()
strptime() is overly complicated in my opinion compared to { lubridate }.
library(lubridate)
d <- c("30/1/12 21:07",
"2/2/12 15:53",
"3/4/12 0:49",
"14/11/12 3:49",
"11/8/13 16:00",
"31/7/14 8:08",
"31/7/14 10:48",
"6/8/14 9:24",
"16/12/14 3:34")
d2 <- dmy_hm(d)
d2
Returns:
[1] "2012-01-30 21:07:00 UTC"
[2] "2012-02-02 15:53:00 UTC"
[3] "2012-04-03 00:49:00 UTC"
[4] "2012-11-14 03:49:00 UTC"
[5] "2013-08-11 16:00:00 UTC"
[6] "2014-07-31 08:08:00 UTC"
[7] "2014-07-31 10:48:00 UTC"
[8] "2014-08-06 09:24:00 UTC"
[9] "2014-12-16 03:34:00 UTC"
As you can see, lubridate functions return POSIXct objects.
class(d2)
[1] "POSIXct" "POSIXt"
Next, you can use lubridate::year() to get the year of each POSIXct object returned by dmy_hm(), and plot that histogram.
hist(year(d2))
Here's one approach. I think your date conversion is fine but you need to count the number of dates that occur in each year then plot that count as a histogram.
library(tidyverse)
# generate some data
date.seq <- tibble(xdate = seq(from = lubridate::ymd_hms('2000-01-01 00:00:00'), to=lubridate::ymd_hms('2016-12-31 24:59:59'), length.out = 100))
date.seq %>%
mutate(xyear = lubridate::year(xdate)) %>% # add a column of years
group_by(xyear) %>%
summarise(date_count = length(xdate)) %>% # Count the number of dates that occur in each year
ggplot(aes(x = xyear, y = date_count)) +
geom_col(colour = 'black', fill = 'blue') # plot as a column graph
There's no problem with strptime()*, however, the format option is intended to specify how the is formatted.
df1$date <- strptime(df1$date, format="%d/%m/%y %H:%M")
# [1] "2012-01-30 21:07:00 CET" "2012-02-02 15:53:00 CET"
# [3] "2012-04-03 00:49:00 CEST" "2012-11-14 03:49:00 CET"
# [5] "2013-08-11 16:00:00 CEST" "2014-07-31 08:08:00 CEST"
# [7] "2014-07-31 10:48:00 CEST" "2014-08-06 09:24:00 CEST"
# [9] "2014-12-16 03:34:00 CET"
What you probably want then is to use the format() function
formatDate <- format(df1$date, format="%F")
(or in this case simpler with formatDate <- as.Date(df1$date))
and then
hist(formatDate, breaks=10, xlab="year")
* credits to #MikkoMarttila
Data
df1 <- structure(list(id = 2:10, date = c("30/1/12 21:07", "2/2/12 15:53",
"3/4/12 0:49", "14/11/12 3:49", "11/8/13 16:00", "31/7/14 8:08",
"31/7/14 10:48", "6/8/14 9:24", "16/12/14 3:34")), class = "data.frame", row.names = c(NA,
-9L))
i have a time series Data with 10 Minutes difference when i try to convert to date and time type using `df$Time1 <- dmy_hm(df$Time, tz="Asia/Calcutta")
it returns NA at 24 o Clock time interval as you can see i have tried with df$Time1 <- dmy_hm(df$Time, tz="Asia/Calcutta")and df$Time1 = as.POSIXct(df$Time, format="%d-%m-%y %H:%M") Please do guide me on this i am clueless whats happening at 02-07-16 00:00
One option would be using parse_date_time from lubridate which can take multiple formats
library(lubridate)
parse_date_time(df$Time, c('dmy_HM', 'dmy'))
#[1] "2016-07-01 23:30:00 UTC" "2016-07-01 23:40:00 UTC"
#[3] "2016-07-01 23:50:00 UTC" "2016-07-02 00:00:00 UTC"
data
df <- data.frame(Time = c("01-07-16 23:30", "01-07-16 23:40", "01-07-16 23:50",
"02-07-16"))
I would like to convert the following dates
dates <-c(1149318000L, 1151910000L, 1154588400L, 1157266800L, 1159858800L, 1162540800L)
into date and time format
I don't know the origin of the date but I know that
1146685218 = 2006/05/03 07:00:00
** Update 1 **
I have sorted the unformatted dates and replace the sample above with a friendly sequence but I have real dates. I was thinking of using the the above key as origin, but It does not seem to work.
let
seconds_in_days <- 3600*24
(dates[2]-dates[1])/seconds_in_days
## [1] 30
If you know
1146685218 = 2006/05/03 07:00:00
then just make that the origin
dates <- c(1149318000L, 1151910000L, 1154588400L, 1157266800L, 1159858800L, 1162540800L)
orig.int <- 1146685218
orig.date <- as.POSIXct("2006/05/03 07:00:00", format="%Y/%m/%d %H:%M:%S")
as.POSIXct(dates-orig.int, origin=orig.date)
# [1] "2006-06-02 18:19:42 EDT" "2006-07-02 18:19:42 EDT" "2006-08-02 18:19:42 EDT"
# [4] "2006-09-02 18:19:42 EDT" "2006-10-02 18:19:42 EDT" "2006-11-02 18:19:42 EST"
This works assuming your "date" values are the number of seconds since a particular date/time which is how POSIXt stores it's date/time values.
From a data frame with timestamped rows (strptime results), what is the best method for aggregating statistics for intervals?
Intervals could be an hour, a day, etc.
There's the aggregate function, but that doesn't help with assigning each row to an interval. I'm planning on adding a column to the data frame that denotes interval and using that with aggregate, but if there's a better solution it'd be great to hear it.
Thanks for any pointers!
Example Data
Five rows with timestamps divided into 15-minute intervals starting at 03:00.
Interval 1
"2010-01-13 03:02:38 UTC"
"2010-01-13 03:08:14 UTC"
"2010-01-13 03:14:52 UTC"
Interval 2
"2010-01-13 03:20:42 UTC"
"2010-01-13 03:22:19 UTC"
Conclusion
Using a time series package such as xts should be the solution; however I had no success using them and winded up using cut. As I presently only need to plot histograms, with rows grouped by interval, this was enough.
cut is used liked so:
interv <- function(x, start, period, num.intervals) {
return(cut(x, as.POSIXlt(start)+0:num.intervals*period))
}
Standard functions to split vectors are cut and findInterval:
v <- as.POSIXct(c(
"2010-01-13 03:02:38 UTC",
"2010-01-13 03:08:14 UTC",
"2010-01-13 03:14:52 UTC",
"2010-01-13 03:20:42 UTC",
"2010-01-13 03:22:19 UTC"
))
# Your function return list:
interv(v, as.POSIXlt("2010-01-13 03:00:00 UTC"), 900)
# [[1]]
# [1] "2010-01-13 03:00:00"
# [[2]]
# [1] "2010-01-13 03:00:00"
# [[3]]
# [1] "2010-01-13 03:00:00"
# [[4]]
# [1] "2010-01-13 03:15:00 CET"
# [[5]]
# [1] "2010-01-13 03:15:00 CET"
# cut returns factor, you must provide proper breaks:
cut(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 2010-01-13 03:00:00 2010-01-13 03:00:00 2010-01-13 03:00:00
# [4] 2010-01-13 03:15:00 2010-01-13 03:15:00
# Levels: 2010-01-13 03:00:00 2010-01-13 03:15:00
# findInterval returns vector of interval id (breaks like in cut)
findInterval(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 1 1 1 2 2
For the record: cut has a method for POSIXt type, but unfortunately there is no way to provide start argument, effect is:
cut(v,"15 min")
# [1] 2010-01-13 03:02:00 2010-01-13 03:02:00 2010-01-13 03:02:00
# [4] 2010-01-13 03:17:00 2010-01-13 03:17:00
# Levels: 2010-01-13 03:02:00 2010-01-13 03:17:00
As you see it's start at 03:02:00. You could mess with labels of output factor (convert labels to time, round somehow and convert back to character).
Use a time series package. The xts package has functions designed specifically to do that. Or look at the aggregate and rollapply functions in the zoo package.
The rmetrics ebook has a useful discussion, including a performance comparison of the various packages: https://www.rmetrics.org/files/freepdf/TimeSeriesFAQ.pdf
Edit: Look at my answer to this question. Basically you need to truncate every timestamp into a specific interval and then do the aggregation using those new truncated timestamps as your grouping vector.
This is an interesting question; with the proliferation of the various time series packages and methods, there ought to be an approach for binning irregular time series other than by brute force that the OP suggests. Here is one "high-level" way to get the intervals that you can then use for aggregate et al, using a version of cut defined for chron objects.
require(chron)
require(timeSeries)
my.times <- "
2010-01-13 03:02:38 UTC
2010-01-13 03:08:14 UTC
2010-01-13 03:14:52 UTC
2010-01-13 03:20:42 UTC
2010-01-13 03:22:19 UTC
"
time.df <- read.delim(textConnection(my.times),header=FALSE,sep="\n",strip.white=FALSE)
time.seq <- seq(trunc(timeDate(time.df[1,1]),units="hours"),by=15*60,length=nrow(time.df))
intervals <- as.numeric(cut(as.chron(as.character(time.df$V1)),breaks=as.chron(as.character(time.seq))))
You get
intervals
[1] 1 1 1 2 2
which you can now append to the data frame and aggregate.
The coersion acrobatics above (from character to timeDate to character to chron) is a little unfortunate, so if there are cleaner solutions for binning irregular time data using xts or any of the other timeSeries packages, I'd love to hear about them as well!..
I am also curious to know what would be the most efficient approach for binning large high-frequency irregular time series, e.g. creating 1-minute volume bars on tick data for a very liquid stock.