I'm using an Excel dataset where the time values, MM:SS, come in numeric values that I need to convert to POSIXct in r and then make calculations.
Below is sample data of what I have and I need to get
dfOrig <- data.frame(StandarTime = c(615,735,615 ),
AchievedTime = c(794,423,544 ))
This is what I'm looking for:
dfCleaned <- data.frame(StandarTime = c("2017-08-25 10:15",
"2017-08-25 12:15",
"2017-08-25 10:15" ),
AchievedTime = c("2017-08-25 13:14 PDT",
"2017-08-25 7:03 PDT",
"2017-08-25 9:04 PDT" ))
I'm not sure how to best approach this problem.
Not sure what the values are but in case these are seconds you can use:
> dfOrig$StandarTime <- ISOdate(2017, 8, 25, hour = 0) + dfOrig$StandarTime
> dfOrig$AchievedTime <- ISOdate(2017, 8, 25, hour = 0) + dfOrig$AchievedTime
> dfOrig
StandarTime AchievedTime
1 2017-08-25 00:10:15 2017-08-25 00:13:14
2 2017-08-25 00:12:15 2017-08-25 00:07:03
3 2017-08-25 00:10:15 2017-08-25 00:09:04
ISOdate(2017, 8, 25, hour = 0) sets the start time, then you can add a value in seconds. You can also specify a time zone using tz = ""
Related
I want to generate the same period during serval days, e.g. from 09:30:00 to 16:00:00 every day, and I know that
dates<- seq(as.POSIXct("2000-01-01 9:00",tz='UTC'), as.POSIXct("2000-04-9 16:00",tz='UTC'), by=300)
can help me obtain the time series observed every 5 minutes during 24 hours in 100 days. But what I want is the 09:30:00 to 16:00:00 over 100 days.
Thanks in advance
Here is one way. We can create a date sequence for every day, and then create sub-list with each day for the five minute interval. Finally, we can combine this list. final_seq is the final output.
date_seq <- seq(as.Date("2000-01-01"), as.Date("2000-04-09"), by = 1)
hour_seq <- lapply(date_seq, function(x){
temp_date <- as.character(x)
temp_seq <- seq(as.POSIXct(paste(temp_date, "09:30"), tz = "UTC"),
as.POSIXct(paste(temp_date, "16:00"), tz = "UTC"),
by = 300)
})
final_seq <- do.call("c", hour_seq)
An option using tidyr::crossing() (which I love) and the lubridate package:
crossing(c1 = paste(dmy("01/01/2000") + seq(1:100), "09:30"),
c2 = seq(0, 390, 5)) %>%
mutate(time_series = ymd_hm(c1) + minutes(c2)) %>%
pull(time_series)
I can convert to POSIXct most of the time like for instance:
as.POSIXct( "20:16:32", format = "%H:%M:%S" )
[1] "2017-06-23 20:16:32 EDT"
But once the time goes beyond 24h, it fails:
as.POSIXct( "24:16:32", format = "%H:%M:%S" )
[1] NA
Which makes some sense as 24:16:32 should rather be read as 00:16:32
Such standards of 24+ are however well spread in the design of public transportation. I could of course replace all "24:" by "00:", but I am sure there is a more elegant way out.
Read the time string into a data frame dd and set next_day to 1 if the hour exceeds 24 or more or 0 if not. Subtract 24 from the hour if next_day is 1 and add 1 day's worth of seconds. Given that today is June 23, 2017 this would work for hours between 0 and 47.
x <- "24:16:32" # test input
dd <- read.table(text = x, sep = ":", col.names = c("hh", "mm", "ss"))
next_day <- dd$hh >= 24
s <- sprintf("%s %0d:%0d:%0d", Sys.Date(), dd$hh - 24 * next_day, dd$mm, dd$ss)
as.POSIXct(s) + next_day * 24 * 60 * 60
## "2017-06-24 00:16:32 EDT"
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
I have been struggling with this for a while now:
I have a data frame that contains 5-minute measurements (for around 6 months) of different parameters. I want to aggregate them and get the mean of every parameter every 30 min. Here is a short example:
TIMESTAMP <- c("2015-12-31 0:30", "2015-12-31 0:35","2015-12-31 0:40", "2015-12-31 0:45", "2015-12-31 0:50", "2015-12-31 0:55", "2015-12-31 1:00", "2015-12-31 1:05", "2015-12-31 1:10", "2015-12-31 1:15", "2015-12-31 1:20", "2015-12-31 1:25", "2015-12-31 1:30")
value1 <- c(45, 50, 68, 78, 99, 100, 5, 9, 344, 10, 45, 68, 33)
mymet <- as.data.frame(TIMESTAMP, value1)
mymet$TIMESTAMP <- as.POSIXct(mymet$TIMESTAMP, format = "%Y-%m-%d %H:%M")
halfhour <- aggregate(mymet, list(TIME = cut(mymet$TIMESTAMP, breaks = "30 mins")),
mean, na.rm = TRUE)
What I want to get is the average between 00:35 and 1:00 and call this DATE-1:00AM, however, what I get is: average between 00:30 and 00:55 and this is called DATE-12:30am.
How can I change the function to give me the values that I want?
The trick (I think) is looking at when your first observation starts. If the first observation is 00:35 and you do the 30 minute cut then the intervals should follow the logic you want. Regarding the name of the Breaks it's just a matter of adding 25 minutes to the name and then you get what you want. Here is an example for 6 months of 2015:
require(lubridate)
require(dplyr)
TIMESTAMP <- seq(ymd_hm('2015-01-01 00:00'),ymd_hm('2015-06-01 23:55'), by = '5 min')
TIMESTAMP <- data.frame(obs=1:length(TIMESTAMP),TS=TIMESTAMP)
TIMESTAMP <- TIMESTAMP[-(1:7),] #TO start with at 00:35 minutes
TIMESTAMP$Breaks <- cut(TIMESTAMP$TS, breaks = "30 mins")
TIMESTAMP$Breaks <- ymd_hms(as.character(TIMESTAMP$Breaks)) + (25*60)
Averages <- TIMESTAMP %>% group_by(Breaks) %>% summarise(MeanObs=mean(obs,na.rm = TRUE))
If you get mymet constructed properly, you can cut TIMESTAMP into bins (which you can do with cut.POSIXt) so you can aggregate:
mymet$half_hour <- cut(mymet$TIMESTAMP, breaks = "30 min")
aggregate(value1 ~ half_hour, mymet, mean)
## half_hour value1
## 1 2015-12-31 00:30:00 73.33333
## 2 2015-12-31 01:00:00 80.16667
## 3 2015-12-31 01:30:00 33.00000
Data
mymet <- structure(list(TIMESTAMP = structure(c(1451539800, 1451540100,
1451540400, 1451540700, 1451541000, 1451541300, 1451541600, 1451541900,
1451542200, 1451542500, 1451542800, 1451543100, 1451543400), class = c("POSIXct",
"POSIXt"), tzone = ""), value1 = c(45, 50, 68, 78, 99, 100, 5,
9, 344, 10, 45, 68, 33)), .Names = c("TIMESTAMP", "value1"), row.names = c(NA,
-13L), class = "data.frame")
I have a data.frame that contains two date columns, one for date of birth (DOB) for an individual, and a reference point in time (Snapshot.Date), let's say it's the date we last saw that individual. There are other columns (omitted), so I'd ideally like the results to be added as a column to my existing data.frame.
I would like to calculate how many months (continuous), between the individuals last birthday (relative to the Snapshot.Date) and the Snapshot.Date.
I've tried a plyr solution and a base sapply solution, and they are both slower than I expected they would be -- (and I need to process one million rows in my 'real' data.frame)
First, here is a test dataset. 20 original records (with the 'special' case of Feb 29th, only existing in a leap year).
data.test = structure(list(Snapshot.Date = structure(c(1433030400, 1396224000,
1375228800, 1396224000, 1383177600, 1362009600, 1367280000, 1369958400,
1346371200, 1348963200, 1435622400, 1435622400, 1435622400, 1435622400,
1435622400, 1435622400, 1435622400, 1435622400, 1435622400, 1346371200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), DOB = structure(c(-268790400,
-155692800, -955065600, -551232000, -149644800, -774230400, -485395200,
-17625600, -131932800, -387244800, 545961600, 18489600, -230515200,
441676800, -32745600, 775180800, 713491200, 483235200, 114307200,
-815443200), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Snapshot.Date",
"DOB"), row.names = c(32806L, 21294L, 14880L, 21730L, 17525L,
8516L, 11068L, 11751L, 2564L, 3832L, 802276L, 1031697L, 129222L,
588224L, 1093247L, 878037L, 370736L, 709108L, 861908L, 2199L), class = "data.frame")
And the function for calculating months (I'm sure this can be improved too).
months_since_last_birthday = function(CurrentDate, DateOfBirth)
{
last_birthday = DateOfBirth
if(month(last_birthday) == 2 & day(last_birthday) == 29) # this birthday only occurs once every four years, let's reset them to be the 28th
{
day(last_birthday) = 28
}
year(last_birthday) = year(CurrentDate)
if(last_birthday > CurrentDate)
{
last_birthday = last_birthday - years(1) #last year's birthday is the most recent occurrence
}
return(as.period(new_interval(last_birthday, CurrentDate)) / months(1))
}
For the base 20 records, here is the desired output:
Snapshot.Date DOB Months.Since.Birthday
32806 2015-05-31 1961-06-26 11.1643836
21294 2014-03-31 1965-01-25 2.1972603
14880 2013-07-31 1939-09-27 10.1315068
21730 2014-03-31 1952-07-14 8.5589041
17525 2013-10-31 1965-04-05 6.8547945
8516 2013-02-28 1945-06-20 8.2630137
11068 2013-04-30 1954-08-15 8.4931507
11751 2013-05-31 1969-06-11 11.6575342
2564 2012-08-31 1965-10-27 10.1315068
3832 2012-09-30 1957-09-24 0.1972603
802276 2015-06-30 1987-04-21 2.2958904
1031697 2015-06-30 1970-08-03 10.8876712
129222 2015-06-30 1962-09-12 9.5917808
588224 2015-06-30 1983-12-31 5.9863014
1093247 2015-06-30 1968-12-18 6.3945205
878037 2015-06-30 1994-07-26 11.1315068
370736 2015-06-30 1992-08-11 10.6246575
709108 2015-06-30 1985-04-25 2.1643836
861908 2015-06-30 1973-08-16 10.4602740
2199 2012-08-31 1944-02-29 6.0986301
Scaling up the dataset for benchmarking:
# Make 5000 records total for benchmarking, didn't replicate Feb 29th
# since it is a very rare case in the data
set.seed(1)
data.test = rbind(data.test, data.test[sample(1:19, size = 4980, replace = TRUE),])
start.time = Sys.time()
res = suppressMessages(adply(data.test , 1, transform, Months.Since.Birthday = months_since_last_birthday(Snapshot.Date, DOB)))
end.time = Sys.time()
# end.time - start.time
# Time difference of 1.793945 mins
start.time = Sys.time()
data.test$Months.Since.Birthday = suppressMessages(sapply(1:5000, function(x){return(months_since_last_birthday(data.test$Snapshot.Date[x], data.test$DOB[x]))}))
end.time = Sys.time()
# end.time - start.time
# Time difference of 1.743053 mins
Am I doing something seriously wrong? Does this seem really slow to you?
Any feedback is welcome!
Unless I'm missing something obvious, there are a bunch of built in ways of working with time data in R, notably base::difftime which may have saved you some trouble.
Taking your above dataset data.test:
data.test$dif <- round(as.vector(as.double(difftime(strptime(data.test$Snapshot.Date, format = "%Y-%m-%d"), strptime(data.test$DOB, format = "%Y-%m-%d"), units = "days"))) %% 365, 1)
or to lay it out more logically (this wont work if you copy paste it).
data.test$dif <-
round(
as.vector(
as.double(
difftime(
strptime(data.test$Snapshot.Date, format = "%Y-%m-%d"),
strptime(data.test$DOB, format = "%Y-%m-%d"), units = "days")
)
)
%% 365,
1)
The above uses the difftime function to find the difference between the two dates with the given format (format = "%Y-%m-%d") in terms of days, then performs remainder division to get the number of days since the last birthday. I personally think this is a better measure than months because a difference of 2 months between July and August is a different number of days than a 2 month difference between January and February.
Note: The above solution does not incorporate leap years. You could easily look up a list of leap years and add 1 day to the checkup or subtract 1 day from the birthday of each individual who lived through that leap year to get an accurate number.
Problem creating data.table with date-time column:
> mdt <- data.table(id=1:3, d=strptime(c("06:02:36", "06:02:48", "07:03:12"), "%H:%M:%S"))
> class(mdt)
[1] "data.table" "data.frame"
> print(mdt)
Error in `rownames<-`(`*tmp*`, value = paste(format(rn, right = TRUE), :
length of 'dimnames' [1] not equal to array extent
Enter a frame number, or 0 to exit
1: print(list(id = 1:3, d = list(sec = c(36, 48, 12), min = c(2, 2, 3), hour = c(6, 6, 7), mday = c(31,
2: print.data.table(list(id = 1:3, d = list(sec = c(36, 48, 12), min = c(2, 2, 3), hour = c(6, 6, 7), m
3: `rownames<-`(`*tmp*`, value = paste(format(rn, right = TRUE), ":", sep = ""))
Create as data.frame and convert to data.table works!
> mdf <- data.frame(id=1:3, d=strptime(c("06:02:36", "06:02:48", "07:03:12"), "%H:%M:%S"))
> print(mdf)
id d
1 1 2014-01-31 06:02:36
2 2 2014-01-31 06:02:48
3 3 2014-01-31 07:03:12
> mdt <- as.data.table(mdf)
> print(mdt)
id d
1: 1 2014-01-31 06:02:36
2: 2 2014-01-31 06:02:48
3: 3 2014-01-31 07:03:12
> class(mdt)
[1] "data.table" "data.frame"
Am I missing anything or is it bug? If a bug, where do I report it?
Note I use R version 3.0.0 and I see some warnings re. packages built with version 3.0.2. Can it be the problem? Should I upgrade R itself? Everything else I do seems to be working though.
Formatting response from Blue Magister's comment (thanks so much), data.table does not support POSIXlt data types for performance reason -- see cast string to IDateTime as suggested as possible duplicate.
So the way to go is to cast time as ITime (type provided by data.table) or date-time (or date only) as POSIXct, depending upon whether date info is important or not:
> mdt <- data.table(id=1:3, d=as.ITime(strptime(c("06:02:36", "06:02:48", "07:03:12"), "%H:%M:%S")))
> print(mdt)
id d
1: 1 06:02:36
2: 2 06:02:48
3: 3 07:03:12
> mdt <- data.table(id=1:3, d=as.POSIXct(strptime(c("06:02:36", "06:02:48", "07:03:12"), "%H:%M:%S")))
> print(mdt)
id d
1: 1 2014-01-31 06:02:36
2: 2 2014-01-31 06:02:48
3: 3 2014-01-31 07:03:12
As an extra note in case someone can benefit from it, I wanted to create date & time from my input data with date & time in separate fields.
I found it useful to learn (see ?ITime) that one can add time ITime to date-time POSIXct and get a date-time POSIXct as follows:
> mdt <- as.POSIXct("2014-01-31") + as.ITime("06:02:36")
> print(mdt)
[1] "2014-01-31 06:02:36 EST"
> class(mdt)
[1] "POSIXct" "POSIXt"