I have a column containing a bunch of timestamps such as 2:03:45, which represent 2 hours, 3 minutes and 45 seconds (not 2:03PM). I'm wondering how I can go about turning 2:03:45 into a minute value, which would be 123 + 45/60 minutes.
I have used strsplit(x$time, ":") so that now it is separated. Is there a way I can run a for loop through the rows, so that it takes the hour * 60 + minutes + seconds/60? Thanks.
install the chron package
install.packages("chron")
load the library
library(chron)
Ill create a timestamp character string to play with
d=c("2:20:34","3:12:22")
read the data into a date-time class
df=strptime(d,format='%H:%M:%S')
> minutes(df)
[1] 20 12
> hours(df)
[1] 2 3
> seconds(df)
[1] 34 22
so all we need to do is compute the minutes from these three
timeinminutes= 60*hours(df)+minutes(df)+seconds(df)/60
Related
I am doing some date/time manipulation and experiencing explicable, but unpleasant, round-tripping problems when converting date -> time -> date . I have temporarily overcome this problem by rounding at appropriate points, but I wonder if there are best practices for date handling that would be cleaner. I'm using a mix of base-R and lubridate functions.
tl;dr is there a good, simple way to convert from decimal date (YYYY.fff) to the Date class (and back) without going through POSIXt and incurring round-off (and potentially time-zone) complications??
Start with a few days from 1918, as separate year/month/day columns (not a critical part of my problem, but it's where my pipeline happens to start):
library(lubridate)
dd <- data.frame(year=1918,month=9,day=1:12)
Convert year/month/day -> date -> time:
dd <- transform(dd,
time=decimal_date(make_date(year, month, day)))
The successive differences in the resulting time vector are not exactly 1 because of roundoff: this is understandable but leads to problems down the road.
table(diff(dd$time)*365)
## 0.999999999985448 1.00000000006844
## 9 2
Now suppose I convert back to a date: the dates are slightly before or after midnight (off by <1 second in either direction):
d2 <- lubridate::date_decimal(dd$time)
# [1] "1918-09-01 00:00:00 UTC" "1918-09-02 00:00:00 UTC"
# [3] "1918-09-03 00:00:00 UTC" "1918-09-03 23:59:59 UTC"
# [5] "1918-09-04 23:59:59 UTC" "1918-09-05 23:59:59 UTC"
# [7] "1918-09-07 00:00:00 UTC" "1918-09-08 00:00:00 UTC"
# [9] "1918-09-09 00:00:00 UTC" "1918-09-09 23:59:59 UTC"
# [11] "1918-09-10 23:59:59 UTC" "1918-09-12 00:00:00 UTC"
If I now want dates (rather than POSIXct objects) I can use as.Date(), but to my dismay as.Date() truncates rather than rounding ...
tt <- as.Date(d2)
## [1] "1918-09-01" "1918-09-02" "1918-09-03" "1918-09-03" "1918-09-04"
## [6] "1918-09-05" "1918-09-07" "1918-09-08" "1918-09-09" "1918-09-09"
##[11] "1918-09-10" "1918-09-12"
So the differences are now 0/1/2 days:
table(diff(tt))
# 0 1 2
# 2 7 2
I can fix this by rounding first:
table(diff(as.Date(round(d2))))
## 1
## 11
but I wonder if there is a better way (e.g. keeping POSIXct out of my pipeline and staying with dates ...
As suggested by this R-help desk article from 2004 by Grothendieck and Petzoldt:
When considering which class to use, always
choose the least complex class that will support the
application. That is, use Date if possible, otherwise use
chron and otherwise use the POSIX classes. Such a strategy will greatly reduce the potential for error and increase the reliability of your application.
The extensive table in this article shows how to translate among Date, chron, and POSIXct, but doesn't include decimal time as one of the candidates ...
It seems like it would be best to avoid converting back from decimal time if at all possible.
When converting from date to decimal date, one also needs to account for time. Since Date does not have a specific time associated with it, decimal_date inherently assumes it to be 00:00:00.
However, if we are concerned only with the date (and not the time), we could assume the time to be anything. Arguably, middle of the day (12:00:00) is as good as the beginning of the day (00:00:00). This would make the conversion back to Date more reliable as we are not at the midnight mark and a few seconds off does not affect the output. One of the ways to do this would be to add 12*60*60/(365*24*60*60) to dd$time
dd$time2 = dd$time + 12*60*60/(365*24*60*60)
data.frame(dd[1:3],
"00:00:00" = as.Date(date_decimal(dd$time)),
"12:00:00" = as.Date(date_decimal(dd$time2)),
check.names = FALSE)
# year month day 00:00:00 12:00:00
#1 1918 9 1 1918-09-01 1918-09-01
#2 1918 9 2 1918-09-02 1918-09-02
#3 1918 9 3 1918-09-03 1918-09-03
#4 1918 9 4 1918-09-03 1918-09-04
#5 1918 9 5 1918-09-04 1918-09-05
#6 1918 9 6 1918-09-05 1918-09-06
#7 1918 9 7 1918-09-07 1918-09-07
#8 1918 9 8 1918-09-08 1918-09-08
#9 1918 9 9 1918-09-09 1918-09-09
#10 1918 9 10 1918-09-09 1918-09-10
#11 1918 9 11 1918-09-10 1918-09-11
#12 1918 9 12 1918-09-12 1918-09-12
It should be noted, however, that the value of decimal time obtained in this way will be different.
lubridate::decimal_date() is returning a numeric. If I understand you correctly, the question is how to convert that numeric into Date and have it round appropriately without bouncing through POSIXct.
as.Date(1L, origin = '1970-01-01') shows us that we can provide as.Date with days since some specified origin and convert immediately to the Date type. Knowing this, we can skip the year part entirely and set it as origin. Then we can convert our decimal dates to days:
as.Date((dd$time-trunc(dd$time)) * 365, origin = "1918-01-01").
So, a function like this might do the trick (at least for years without leap days):
date_decimal2 <- function(decimal_date) {
years <- trunc(decimal_date)
origins <- paste0(years, "-01-01")
# c.f. https://stackoverflow.com/questions/14449166/dates-with-lapply-and-sapply
do.call(c, mapply(as.Date.numeric, x = (decimal_date-years) * 365, origin = origins, SIMPLIFY = FALSE))
}
Side note: I admit I went down a bit of a rabbit hole with trying to move origin around deal with the pre-1970 date. I found that the further origin shifted from the target date, the more weird the results got (and not in ways that seemed to be easily explained by leap days). Since origin is flexible, I decided to target it right on top of the target values. For leap days, seconds, and whatever other weirdness time has in store for us, on your own head be it. =)
I have a model which predicts the duration of certain events, and measures of durations for those events. I then want to compute the difference between Predicted and Measured, the mean difference and the RMSE. I'm able to do it, but the formatting is really awkward and not what I expected:
database <- data.frame(Predicted = c(strptime(c("4:00", "3:35", "3:38"), format = "%H:%M")),
Measured = c(strptime(c("3:39", "3:40", "3:53"), format = "%H:%M")))
database
> Predicted Measured
1 2016-11-28 04:00:00 2016-11-28 03:39:00
2 2016-11-28 03:35:00 2016-11-28 03:40:00
3 2016-11-28 03:38:00 2016-11-28 03:53:00
This is the first weirdness: why does R shows me a time and a date, even if I clearly specified a time-only format (%H:%M), and there was no date in my data to start with? It gets weirder:
database$Error <- with(database, Predicted-Measured)
database$Mean_Error <- with(database, mean(Predicted-Measured))
database$RMSE <- with(database, sqrt(mean(as.numeric(Predicted-Measured)^2)))
> database
Predicted Measured Error Mean_Error RMSE
1 2016-11-28 04:00:00 2016-11-28 03:39:00 21 mins 0.3333333 15.17674
2 2016-11-28 03:35:00 2016-11-28 03:40:00 -5 mins 0.3333333 15.17674
3 2016-11-28 03:38:00 2016-11-28 03:53:00 -15 mins 0.3333333 15.17674
Why is the variable Error expressed in minutes? For Error it's not a bad choice, but it becomes quite hard to read for Mean_Error. For RMSE it's even worse, but this could be due to the as.numeric function: if I remove it, R complains that '^' not defined for "difftime" objects. My questions are:
Is it possible to show the first 2 columns (Predicted and Measured) shown in the %H:%M format?
for the other 3 columns ( Error, Mean_Error and RMSE) I would like to compare a %M:%S format and a format in only seconds, and choose among the two. Is it possible?
EDIT: just to be more clear, my goal is to insert observations of time intervals into a dataframe and compute a vector of time interval differences. Then, compute some statistics for that vector: mean, RMSE, etc.. I know I could just enter the time observations in seconds, but that doesn't look very good: it's difficult to tell that 13200 seconds are 3 hours and 40 minutes. Thus I would like to be able to store the time intervals in the %H:%M, but then be able to manipulate them algebraically and show the results in a format of my choosing. Is that possible?
We can use difftime to specify the units for the difference in time. The output of difftime is an object of class difftime. When this difftime object is coerced to numeric using as.numeric, we can change these units (see the examples in ?difftime):
## Note we don't convert to date-time because we just want %H:%M
database <- data.frame(Predicted = c("4:00", "3:35", "3:38"),
Measured = c("3:39", "3:40", "3:53"))
## We now convert to date-time and use difftime to compute difference in minutes
database$Error <- with(database, difftime(strptime(Predicted,format="%H:%M"),strptime(Measured,format="%H:%M"), units="mins"))
## Use as.numeric to change units to seconds
database$Mean_Error <- with(database, mean(as.numeric(Error,units="secs")))
database$RMSE <- with(database, sqrt(mean(as.numeric(Error,units="secs")^2)))
## Predicted Measured Error Mean_Error RMSE
##1 4:00 3:39 21 mins 20 910.6042
##2 3:35 3:40 -5 mins 20 910.6042
##3 3:38 3:53 -15 mins 20 910.6042
So this is the question.
Suppose you track your commute times for two weeks (10 days) and you find the following times in minutes
17 16 20 24 22 15 21 15 17 22
Suppose that the ‘24’ was a mistake, and it should have been 18. Write a code that fixes this, i.e. changing ‘24’ to ‘18’. Then compute for the new mean and standard deviation of the commute times.
Write a code which counts the number of instances that the commute time is at least 20 minutes. Then convert this into a percentage.
This is my solution for Q3 when I ran this code. I want to ask anybody if my solution is correct?
commute <- c(17,16,20,24,22,15,21,15,17,22)
commute[commute==24] <- 18
n <- length(commute)
sum((commute>=20)/n)
#[1] **0.4**
to complete the answer of the user20650, you could use a string formatted command to correctly display the outcome as a percentage as requested:
sprintf("%0.2f%%",100* mean(commute>=20))
[1] "40.00%"
I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)
Let's say I have a time series with daily data (business days), and I would like to organize the data by business weeks. (Monday-Friday) in a similar fashion as the one in this webpage from the EIA on futures prices of crude oil:
http://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RCLC1&f=D
As you can see the prices are nicely organized by weeks in this webpage.
Is there any function in R that could organize the data in a similar fashion?
You can obtain the data in .xls format at:
http://www.eia.gov/dnav/pet/hist_xls/RCLC1d.xls
What I would like to do is to assign a week number to each daily observation something like this: (Look at the weeks column)
Date Price weeks day
1983-04-04 29.44 1 Monday
1983-04-05 29.71 1 Tuesday
1983-04-06 29.92 1 Wednesday
1983-04-07 30.17 1 Thursday
1983-04-08 30.38 1 Friday
1983-04-11 30.26 2 Monday
...
...
So far I have used the week function of the lubridate package but is not working well. It seems like once a year hits the 53rd week the function fails to initiate properly the week of the following year.
I have been trying to stay away from rep, seq /5 or /7 kind of solutions since there may be some observations that I may need to filter from the data later on, so I would like to have a solution that doesn't depend on the particular vector of my data but rather I would prefer the solution to be more general, that is to depend on the date class, i.e POSIcxt, xts or zoo class
Any hints would be greatly appreciated.
Wouldn't this work?:
as.POSIXlt()$yday %/% 7
I realize that it does have part of what you wanted to avoid but it does draw its starting point from a recognized class. For your data noting that I read it in with colClasses=c("Date", "numeric","numeric","character") :
> 1 + as.POSIXlt(dat$Date)$yday %/% 7
[1] 14 14 14 14 14 15
If you want to replicate those interval labels, try adding multiples of 7 to any Monday and Friday:
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(39)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(39)*7,
sep="")
#[1] "1984-01-02 to 1984-01-06" # The first new year change
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(39+52)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(39+52)*7,
sep="")
#[1] "1984-12-31 to 1985-01-04" # The second new year change
Here's a function that will accept an integer vector:
from8Apr83dts <- function(numwks) {
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(numwks)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(numwks)*7,
sep="")
}
# Usage
from8Apr83dts(39:40)
#[1] "1984-01-02 to 1984-01-06" "1984-01-09 to 1984-01-13"